Notes on German Words

Lemmatisation and Frequency Bands

David Rosson
Linguistic Curiosities
3 min readJul 18, 2019

--

Lemmatisation with Morphy

The same word may take many forms — finding the lemma means reducing an inflected form to its citation form — if we want to look up the meaning of ‘counting ducks’, we read the dictionary entries for ‘count’ and ‘duck’.

For German, ‘Morphy’ is a great database for this task. Here I’m using a simplified mapping file — the complete version is way larger and provides detailed grammatical tagging information. The simplified index just maps a more inflected form to a less inflected form. Note how it’s actually recursive:

Mannheim Frequency Bands

What I call the ‘Mannheim list’ comes from the DeReWo database, “kindly brought to you by the Institut für Deutsche Sprache”. It returns a ‘frequency band’ value for each lemma — it’s similar to the power in a log expression — which I think makes more sense than a precise ordinal rank number. Since the true frequency of the 3999th word may not be qualitatively different than that of the 4001st word, they might as well have the same ‘band’ value.

Now combining the two sources, you get a mapping table that returns both the ‘base form’ and its frequency information

Stop words for German

This is a domain-specific term in computational linguistics, which means words that are “filtered out” — the correlate in lexicography would be ‘function words’, ‘grammatical words’ or ‘closed-class words’.

The meaning of such a word is extremely abstract, elusive, or contextual. For example, ‘apple’ is a type of fruit, but what is ‘the’ or ‘than’? Therefore we treat these words differently than the rest of the lexicon: the ‘open classes’ or ‘lexical items’ whose nature is more encyclopaedic.

The content of ‘stop words’ varies drastically depending on the application. If you are generating a ‘tag cloud’ that shows only thematic ‘keywords’, then you might want to filter out words like ‘especially’, but for a language learner, you might actually want to find out the meaning of such words.

I will use a list from Snowball that keeps quite strictly to the closed classes.

Token Characteristics

This step checks whether a token is lexical, and if it is, save its ‘band’ value for frequency highlighting. Essentially, whether a string is lexical depends on whether it can be found in a given dictionary.

The decision flow is as follows:

  1. Does the token consist of only alphabetic characters?
  2. Is the token a stop word?
  3. Can the token be found in the Mannheim database?

If an entry from Mannheim is found, then the ‘band’ value of that entry is saved under that token’s embedded structure. Eventually, a lexical token will have a bunch of extra fields that are relevant to the learner:

Frequency highlighting.

--

--