How to Guess the Gender in German [1]

An Interplay between Morphemes and Ontology

David Rosson
Linguistic Curiosities
9 min readOct 1, 2019

--

Academics in social sciences are often bound by the relativist axiom “all languages are exactly equally complex.” A few minutes into a more fact-contingent context like second language acquisition or lexicography, we find out that some languages are more equally complex than others.

Grammatical Complexity

Many questions must be resolved before you can utter a correct sentence in German. Take this example in English:

“I ponder the meaning of life.”

If you think all you need is lexical translations for “ponder” and “meaning of life” to convert the sentence into German, you’re in deep trouble.

Let’s pause and marvel at the German equivalent:

Ich denke über den Sinn des Lebens nach.

  • “Life” is a nominalised verb from “to live”, therefore it’s neuter.
  • The genitive of “life” goes with the definite article “des” because it’s singular (number), genitive (case), and neuter (gender).
  • For neuter genitive specifically, the noun is also declined with “-s”.
  • The main verb “nach|denken” is separable: the participle “nach” can often teleport itself all the way to the end of the sentence.
  • The verb is naturally conjugated according to one of six persons… the informal second person plural can be tricky, or the informal singular, especially with irregular / strong verbs, but here we’re let off easy with first person singular in the present tense.
  • Having a prepositional participle “nach” included in the verb somehow isn’t enough, the phrasal verb must also go with an extra preposition “über”. This kind of verb-preposition pairing is to be memorised.
  • Then, “über” also implies its target must be in the accusative, that is, to indicate the direct object of the verb.
  • To use the accusative correctly, the gender of the accused object must also be known. “Sinn” is masculine, nominatively “der Sinn”, therefore the accusative is “den Sinn”.
  • If you wish to qualify the nouns with adjectives, e.g. “the elusive meaning of the unconsidered life”, the adjectives must then be declined according to a combinatorial matrix of (3 genders + plural) times (4 cases) times (strong vs. weak vs. mixed declensions) depending on the article being null, indefinite, or definite.

There may be some internally coherent logic to the claim that German and English are “equally complex” as natural languages, but if you insist that they require an equal amount of effort to learn, you are probably just deluding yourself.

The Logic and Lack of Logic in Noun Genders

Personally, each time I’m about to have a short German conversation, out of necessity, I look up all the nouns I plan to use, and try to get the Genus right.

At some point, I would like to remember the Genus for most of the common nouns. Once you go beyond a few thousand words down the frequency list, you see a lot of compound nouns, rather than entirely new nouns. This is good news, because it turns out, the word “Einzelzimmer” has the same gender as “das Zimmer”, so do “Wohnzimmer”, “Schlafzimmer”, “Konferenzzimmer” and so on. Remember one and you get the rest for free.

Growth: the German lexicon is as much about having words as making up words.

Can we reduce the “surface area” of learning (or memorisation) even further? Are there patterns in German noun genders? Can noun genders be predicted based on rules? Let’s start with a few theories about how genders are attributed to nouns:

Arbitrary assignment theory: German speakers just made them up in the past, then conventions converged. This is a pseudo-theory since it does not explain how they were made up, nor does it offer any hope beyond “you just have to remember them”. It also appears quite obviously wrong, considering the fact that genders are not randomly distributed amongst words.

Ontological theory: nouns refer to things (referents) in the world and the gender of a noun says something about the nature of the thing it refers to. This theory is rather intuitive, “der Mann” is masculine because the “man” is masculine, so are active agents ending in “-er/-or/-eur”. Whereas, a self-contained, generic instance of a class of some kind of “thing”, is neuter like “das Ding”, for example “das Object”, “das Spielzeug” and so on.

The idea is that: the noun refers to a kind of thing, and that thing has some “ontological class” or category, and that leads to a certain category as Genus.

Named time units, days (der Tag, der -tag), months (der Monat, der Oktober), and seasons (der Sommer) — but not “week” and “year”, cardinal points (der Nord), weather conditions (der Schnee, der Wind, der Regen, der Nebel), liquor (der Wein, der Whiskey) — but not “beer”, stones 💎 (also der Sand), currency (der Dollar, der Pfennig, der Yen; but das Pfund, die Mark), base forms of professions (der Soldat, der Anwalt), volition (der Zweck, der Wunsch, der Wille), movement (der Flug, der Umzug, der Schlag), intentional assembly (der Bau, der Platz, der Rat, der Markt), rivers (specifically, “outside Germany” 🤷‍♂️), and most lakes, mountains are masculine.

Riversin Germany” 🤷‍♀, fields of study (die Mathematik, die Phonologie, die Literatur), cardinal numbers (ordinal numbers are adjectives), unnamed time units and collectives (die Stunde, die Sekunde, die Woche, die Zukunft), qualities and attributes (die Qualität, die Frequenz), settings (die Szene), conditions, environments (die Basis, die Gefahr, die Region, die Saison), abstract forces (die Macht, die Wehr, die Pflicht), collectives (die Familie, die Kirche, die Firma), ambient collectives (die Welt, die Mode, die Kunst, die Kultur, die Wirtschaft) are feminine.

Materials (das Papier, das Wasser, das Metall), minerals (das Salz), metals (das Gold), means in relation to man (das Feld, das Mittel, das Geld), colours, places (das Land), premises (das Gelände, das Gebiet, das Zimmer), offspring (das Kind, das Baby), diminutives (das Mädel), artefacts (das Stück, das Dokument, das Bild, das Porträt, das Ticket, das Paket), man-made inventions (das Gerät, das Flugzeug, das Werk, das Möbel) and non-material concepts (das Konzept, das Schema, das Ziel, das Verbot, das Institut, das Spiel), and instances of a collective (das Mitglied, das Datum), are neuter.

These ontological properties transfer over to abbreviated forms:

der Akku ← der Akkumulator
die
Uni ← die Universität

Also transferring from loan words:

der Burrito 🇩🇪el burrito 🇲🇽
die Pizza 🇩🇪 la pizza 🇮🇹
das Sushi 🇩🇪 寿司 🇯🇵

Even laterally from German words to loan words:

der Pyjama ← der Anzug
die Email ← die Nachrichten

Now, my favourite:

Morphological theory: have you noticed, many names in the periodic table rhyme? Morphology parallels ontology! Namely, the “shape” or appearance of the words, in part, reflects properties of the thing that the word refers to. In all reasonable universes, the gender of ‘Olym­piamann­schaft’ should be predictable from the gender of ‘Mannschaft’. Not only that, we may soon discover that it’s in fact predictable based on ‘-schaft’.
(But not “der Schaft”, turns out that just happens to be something else…)

“Endings” (the word-final n-grams) are the most predictive (approaching “exception-free”) when they are the result of compounding (rather than coincidentally matching substrings). For example, ‘wineglass’ is really just a type of ‘glass’ for ‘wine’, but ‘carpet’ is not a type of ‘pet’ in a ‘car’.

Also with prefixes:

der Bedarf, der Befehl, der Begriff, der Besitz…
das Gefühl, das Gemüse, das Gepäck, das Gerät,
das Gesetz, das Geschenk, das Gespräch
But — der Gepard, and die Gefahr, die Gewalt

And with nominalisation:

For example, the “gerund form” of an infinitive is neuter:

das Schreiben, das Sein

Or “reverse nominalisation”, where the masculine noun is the stem:

verb: versuchen — noun: der Versuch
verb: schlagen — noun: der Schlag

Note: useful for raising computer monitors

“Der, die, oder das Nutella?”

We’ll know in a minute.

A Computational Approach

Back to the story, I would like to remember the gender for thousands of nouns — to greatly reduced the number of items to memorise, I should look for compounds — because the gender of the compound is entirely predictable from the base form (on the right-hand side of the orthographic string).

Next, we can go one step further, and use “endings” (also groups of letters from the right-hand side of the word) to guess the gender. Some endings have strong predictive powers, e.g. ‘-ung’, ‘-tion’, ‘-ment’. Others like ‘-um’, ‘-ma’ at least help to a degree. So here’s the idea, why not run the numbers to see how predictive each of these endings can be?

But why stop there? If we make the bold assumption:

Compounds ≈ Endings ≈ N-Grams

Let’s just look at n-grams from the right-hand side, and catch ‘em all!

Method

The n-grams includes compound stems, endings, and up to whole words — skipping the hassle of having separate methods for discovering compounds, comparing endings and partial endings, and so on.

If in 99%+ of cases, ‘-schaft’ marks a feminine noun, then we have found a high-confidence ending. Then it moves along the slope from “stems” to “rules” to “lores”, with corresponding probabilities for ‘-aft’, ‘-ft’, or just ‘-t’.

Then, gather the most predictive (with high-confidence of .95, .99) and most prolific (with a large number of instances, in terms of both word types and tokens, i.e. weighted by frequency) and endings, and put them into a card deck. Add fancy colour-coding for confidence levels; add swiping interface add repetition tracking for probabilistic attainment; later, add sleek thematic tags e.g. “season”, “weather”, “scientific unit”…

We could even use parsed corpora to calculate the frequency of cases in usage for each noun — and since some agreement rules (declensions, articles) are collapsed between gender categories, e.g. masculine and neuter both have ‘ein’ in the nominative, and both ‘-em’ in the dative — we can calculate the “cost” of errors (where a misremembered gender would “show” and how frequently), and prioritise which items to remember based on that…

For now, perhaps just a simple script.

Morphology Data

To get the gender information, one possibility is to extract it from the dataset of a morphological parser, for example, Morphy:

Lezius, W., Rapp, R., & Wettler, M. (1998). A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German. In Proceedings of the COLING-ACL 1998, Volume 2 (pp. 743–748).

But I might just use some dictionary data instead…

Frequency Data

The frequency information comes from the “Mannheim” list (aka. DeReWo).

Perkuhn, R., Belica, C., Kupietz, M., Keibel, H., & Hennig, S. (2009). DeReWo: Korpusbasierte Wortformenliste. Institut für Deutsche Sprache, Mannheim.

Each lemma is given a “frequency band” (rather than raw count, for reasons that are another interesting discussion) ranging from 0 to 29 where a smaller number means higher frequency.

For this experiment I will use both this “frequency band” and the “rank” (the word’s position in that list) to weigh frequency.

Results

  • The results include 105,698 “endings” (including whole words), covering about the 40K most common words, of which 23K are nouns.
  • The rows are sorted by “n-gram weight”, which is a number that reflects how many children an ending has, and the frequency ranking of those children. A higher value means more words end with this ending, and those words are more common.
  • There are three columns for “unweighted distribution”, which is based on each child instance having “1 vote”, e.g. if an ending has 4 words that contain the ending, of which one is masc., one is fem., and two are neuter, then the unweighted distribution is 0.25: 0.25 : 0.50.
  • The frequency-weighted columns are calculated by weighing each instance by the inverse of the instance’s frequency band — having a few very common words gets more votes than having lots of uncommon words.

The Full CSV file on GitHub.

Details and analyses of the results:

Part 2: Top Noun Endings

--

--