Training a Swedish NER-model for Stanford CoreNLP part 1

Preparing the training data

Andreas Klintberg
5 min readApr 8, 2015

This post will be a bit longer than the last ones, and it will be part 1 of 2 on how to train a NER-model for CoreNLP.

Training the model was essentially the 5 following steps, this post will describe post 1–4:

  1. Getting some training data manually annotated proper nouns, but no classification on whether they were PER, ORG or LOC.
  2. Formatting the training data to the Stanford NER training data format.
  3. Getting gazetteers for Locations, Persons and Organizations, scraping Swedish locations and Swedish names.
  4. Scripted lookup of the proper noun to see if we have a gazetteer to classify the proper noun from the training data, an automatic annotation of sorts.
  5. Training the model with the automatically annotated data (Part 2)

1. Training data

I got the training data for Swedish on http://spraakbanken.gu.se/swe/resurser which have a range of corpuses within different areas. Mostly free downloads, but some have different licenses so make sure to take a quick peak on those before downloading.

The format on those are as follows (a small excerpt):

<corpus id=”webbnyheter2001">
<text title=”Stammisliv” url=”http://www.dn.se/Pages/Article.aspx?id=858249&amp;epslanguage=sv" newspaper=”Dagens Nyheter” date=”2001–03–25 13:11:00" datefrom=”20010325" dateto=”20010325" timefrom=”131100" timeto=”131100">
<sentence id=”69f65278–69545d84">
<w pos=”DT” msd=”DT.UTR.SIN.IND” lemma=”|en|” lex=”|en..al.1|” saldo=”|den..1|en..2|” prefix=”|” suffix=”|” ref=”01" dephead=”02" deprel=”DT”>En</w><w pos=”NN” msd=”NN.NEU.SIN.IND.GEN” lemma=”|” lex=”|” saldo=”|” prefix=”|semester..nn.1|” suffix=”|resenär..nn.1|” ref=”02" dephead=”03" deprel=”DT”>semesterresenärs</w><w pos=”NN” msd=”NN.UTR.SIN.IND.NOM” lemma=”|agenda|” lex=”|agenda..nn.1|” saldo=”|agenda..1|” prefix=”|” suffix=”|” ref=”03" dephead=”04" deprel=”SS”>agenda</w><w pos=”VB” msd=”VB.PRS.AKT” lemma=”|vara|” lex=”|vara..vb.1|” saldo=”|vara..1|” prefix=”|” suffix=”|” ref=”04" dephead=”19" deprel=”MS”>är</w><w pos=”AB” msd=”AB.SUV” lemma=”|ofta|” lex=”|ofta..ab.1|” saldo=”|ofta..1|” prefix=”|” suffix=”|” ref=”05" dephead=”04" deprel=”TA”>oftast</w><w pos=”JJ” msd=”JJ.POS.UTR.SIN.IND.NOM” lemma=”|full|” lex=”|full..av.1|” saldo=”|full..1|full..2|” prefix=”|” suffix=”|” ref=”06" dephead=”04" deprel=”SP”>full</w><w pos=”PP” msd=”PP” lemma=”|av|” lex=”|av..pp.1|” saldo=”|av..1|” prefix=”|” suffix=”|” ref=”07" dephead=”04" deprel=”AG”>av</w><w pos=”NN” msd=”NN.NEU.PLU.IND.NOM” lemma=”|tips|” lex=”|tips..nn.1|” saldo=”|tips..1|tips..2|” prefix=”|” suffix=”|” ref=”08" dephead=”07" deprel=”PA”>tips</w><w pos=”PP” msd=”PP” lemma=”|från|” lex=”|från..pp.1|” saldo=”|från..1|” prefix=”|” suffix=”|” ref=”09" dephead=”08" deprel=”ET”>från</w><w pos=”JJ” msd=”JJ.POS.UTR+NEU.PLU.IND+DEF.NOM” lemma=”|bekant|” lex=”|bekant..av.1|” saldo=”|bekant..1|” prefix=”|” suffix=”|” ref=”10" dephead=”09" deprel=”PT”>bekanta</w><w pos=”PP” msd=”PP” lemma=”|på|” lex=”|på..pp.1|” saldo=”|på..1|” prefix=”|” suffix=”|” ref=”11" dephead=”10" deprel=”ET”>på</w><w pos=”NN” msd=”NN.UTR.PLU.IND.NOM” lemma=”|sak|” lex=”|sak..nn.1|” saldo=”|sak..1|sak..2|sak..3|” prefix=”|” suffix=”|” ref=”12" dephead=”11" deprel=”HD”>saker</w><w pos=”PN” msd=”PN.UTR.SIN.IND.SUB” lemma=”|man|” lex=”|man..pn.1|” saldo=”|man..1|” prefix=”|” suffix=”|” ref=”13" dephead=”14" deprel=”SS”>man</w><w pos=”VB” msd=”VB.PRS.AKT” lemma=”|måste|” lex=”|måste..vb.1|” saldo=”|måste..1|” prefix=”|” suffix=”|” ref=”14" dephead=”11" deprel=”UA”>måste</w><w pos=”VB” msd=”VB.INF.AKT” lemma=”|se|” lex=”|se..vb.1|” saldo=”|se..1|se..2|se..3|se..4|se..5|se..6|ses..1|” prefix=”|” suffix=”|” ref=”15" dephead=”14" deprel=”VG”>se</w><w pos=”MID” msd=”MID” lemma=”|” lex=”|” saldo=”|” prefix=”|” suffix=”|” ref=”16" dephead=”04" deprel=”IK”>,</w><w pos=”VB” msd=”VB.PRS.AKT” lemma=”|måste|” lex=”|måste..vb.1|” saldo=”|måste..1|” prefix=”|” suffix=”|” ref=”17" dephead=”04" deprel=”+F”>måste</w><w pos=”VB” msd=”VB.INF.AKT” lemma=”|göra|” lex=”|göra..vb.1|” saldo=”|göra..1|göra..2|göra..3|” prefix=”|” suffix=”|” ref=”18" dephead=”17" deprel=”VG”>göra</w><w pos=”KN” msd=”KN” lemma=”|och|” lex=”|och..kn.1|” saldo=”|och..1|” prefix=”|” suffix=”|” ref=”19" dephead=”” deprel=”ROOT”>och</w><w pos=”VB” msd=”VB.PRS.AKT” lemma=”|måste|” lex=”|måste..vb.1|” saldo=”|måste..1|” prefix=”|” suffix=”|” ref=”20" dephead=”19" deprel=”+F”>måste</w><w pos=”VB” msd=”VB.INF.AKT” lemma=”|uppleva|” lex=”|uppleva..vb.1|” saldo=”|uppleva..1|” prefix=”|upp..ab.1|” suffix=”|leva..vb.1|” ref=”21" dephead=”20" deprel=”VG”>uppleva</w><w pos=”MAD” msd=”MAD” lemma=”|” lex=”|” saldo=”|” prefix=”|” suffix=”|” ref=”22" dephead=”19" deprel=”IP”>.</w></sentence>
</text>

So a bit of an explanation: XML-format, first a bit of information on the Corpus and where it is collected from. <sentence> is the start of a sentence and </sentence> is where it ends. <w …> is a word, and there is one <w…> for each word in the sentence. For each word we have a range of different attributes, pos is the POS-tag, lemma is the lemmatized version of the word.

In some of the corpuses the sentences are scrambled, i.e. the order of the sentences are messed up, because of copyright issues.

2. Formatting the data for CoreNLP NER

The data format of Stanford CoreNLP NER is as follows:

I	O
complained O
to O
Microsoft ORGANIZATION
about O
Bill PERSON
Gates PERSON
. O

They O
told O
me O
to O
see O
the O
mayor O
of O
New LOCATION
York LOCATION
. O

A tab separated file, where the first column is the word and the second column is the Label of the word, if it is not an entity, it is merely a zero, which means it has no label. I’ve choosen to use four different labels:

  • PER (Persons)
  • ORG (Organizations)
  • LOC (Locations)
  • (MISC) (All other stuff, eg. products etc)

So I had to write a formatting script to get from XML to the new training data, and unfortunately the training data are not NER annotated, which means that we only know if a word is a proper noun or not, not what category/label it has (we will take care of this in the next step). https://github.com/klintan/corpusxml2corenlp/blob/master/xml2corenlp.py shows you the code. This scripts does the transformation, all though it will put “LABEL” on each proper noun, which will need to be replaced with the correct category in the next step.

3. Getting gazetteer data

To classify the “LABEL” we now have listed on every proper noun, looking something like this:

och 0
den 0
andra 0
i 0
den 0
stora 0
. 0
Gunhild LABEL
Westman LABEL
, 0
tidigare 0
lektor 0
i 0
pedagogik 0
vid 0
Uppsala LABEL
universitet 0
och 0
som 0
forskat 0
på 0
barns 0
lek 0
, 0
anser 0
att 0
naturlig 0
lek 0
är 0

we need to get some gazetteers to “automatically” classify each entity to a label. I would have loved to share what I scraped, for others to use, but the legality of scraping names and locations, are a bit of a grey area I guess. However I got lots of first names and surnames, mostly Swedish.

Further I got all the locations in Sweden, all roads and all cities, I also got a hold of some organisations, but only a very small subset of Swedish companies/organisations. All in all I got some ~160 000 entities in my database.

The scraping was done using BeautifulSoup in Python, and there are lots of tutorials out there on how to simply scrape sites, so I will omit that part for this excursion.

4. Classifying the “LABEL” automatically using our gazetteers

Simply put, we want to check every word which has “LABEL” assigned to it (which is a proper noun according to our original training data) against the gazetteer, to see which label it has (if it also exists, otherwise it will just keep “LABEL” as the label).

If initially we for instance have

. 0
Gunhild LABEL
Westman LABEL
, 0
tidigare 0
lektor 0

The resulting training file will look like (in a perfect world):

. 0
Gunhild PER
Westman PER
, 0
tidigare 0
lektor 0

Assuming that both “Gunhild” and “Westman” exists in the database and are labeled as PER.

In my gazetteer database, “Gunhild” did not exist but “Westman” did, so I ended up with:

. 0
Gunhild LABEL
Westman PER
, 0
tidigare 0
lektor 0

The script for doing this is available here https://github.com/klintan/gazetteer2trainingdata.

To increase the quality of gazetteer, or the coverage of labels/gazettes, I will manually add missing labels from the training file, to the database. Further I will be able to rerun the script to check the unlabeled examples.

That is it for part 1, next part will be using the data to train the NER model for Swedish. Part 2 will also (probably) include a method/script to make sure we only make sure we use sentences from the training data which is fully labeled, i e no sentences which still have the label “LABEL”.

For more gazetteers:

https://github.com/klintan/swedish-gazetteers — A range of gazetteers that I’ve collected.

Large amount of cities:

I work for the company Meltwater, where we do things like this on a daily basis, feel free to check it out: www.meltwater.com

--

--