How to Guess the Gender in German [2]

The Most Predictive Noun Endings

David Rosson
Linguistic Curiosities
8 min readDec 22, 2019

--

Endings

Following the spirit of Zipf and Pareto, the top 100 endings cover a whopping 16,775 unique words (in true-positive guesses), with a median certainty of 97%. The next 100 endings cover 1,570 words.

Unigrams

How much can a single letter tell us? In most cases, not a lot…

The most applicable one is perhaps ‘-e’, a single letter that predicts a 0.88 chance for the noun to be feminine — with many, many exceptions.

Bigrams

Probability, gender, bigram, example words:

Note! Old results, containing diphthong splitting artefacts (e.g. -um stats are confounded by -aum)

Reverse Reasoning

Now, to answer the Nutella question… We can see from the ending ‘-la’, that a word of such shape is more likely be feminine, barring compelling overrides (for example, the substance that we commonly refer to as “Nutella” is a type of material) — it should also be noted that ‘-la’ is clearly not a native element…

"la": {
"f": [
"Villa",
"Aula",
"Tombola",
"Gala",
"Skala",
"Telenovela"
],
"m": [
"Gorilla", // (primate)
"Tequila" // (liquor)
]
}

Ontological transfer:
‘der Affe’ 🐒 → ‘der Gorilla’ 🦍
‘das Pferd’ 🐎→ ‘das Zebra’ 🦓 (though ‘-ra’ is 82% feminine)

Top (True) Base Words

These 41 words have 768 children. That’s an 19x ROI!

A “false base word” is like ‘-ei’, as an ending it’s 94% feminine, and has 140 children — it’s also a word on its own, but its children may or may not be its compounds — ‘das Rührei’ is a type of ‘Ei’, while ‘die Polizei’ is not.

There’s more!

Feminine base words, 505 children:

Nahme, Fahrt, Bahn, Bank, Welt, Gemeinde, Steuer, Frau,
Kammer, Pause, Liga, Firma, Rede, Phase, Nummer, Folge,
Wehr, Gefahr, Sorge, Insel, Freude, Sicht, Angst, Liebe,
Idee, Armee, Gewalt, Tochter, Box, Burg

Masculine base words, 1087 children:

Tag, Bau, Hof, Dienst, Chef, Verband, Verein, Verkehr, Abend, 
Wert, Film, Konzern, Staat, Stein, Wechsel, Turm, Park, Arzt,
Rat, Freund, Stoß, Verlust, Gast, Name, Schein, Erfolg, Title,
Kollege, Bezirk, Kandidat, Anwalt, Brief, Start, Stil, Wirt,
Vorteil, Soldat, Bischof, Tod, Spaß

Neuter base words, 701 children:

Haus, Werk, Amt, Mittel, Jahr, Gericht, Zeug, Mitglied, 
Theater, Konzert, Wasser, Schiff, Gebäude, Team, Heim,
Paar, Produkt, Institut, Büro, Land, Dorf, Finale, Turnier,
Gelände, Fon, Hotel, Interesse, Volk, Schloss, Ende, Auge

Caveats

There are a lot of artefacts in the results, which fall into several categories:

  • Where the group of words are all compounds of the same base word, rather than different base words of the same ending.
  • Where a shorter ending would be sufficient, but it gets split into superstrings that give the same results… for example, ‘-gion’, ‘-lion’, ‘-nion’, ‘-sion’, ‘-tion’ should just be collapsed into ‘-ion’, which are all feminine.
  • Where a single phoneme is spelled with multiple letters, e.g.sch’, the onset of the syllable should stay as an intact unit…
  • But then, ‘Bischof’ is a useful mnemonic to remember the gender of ‘Hof’ even though it’s unrelated

More on phoneme units

  • This orthography-based n-gram has some limitations, for example, ‘eund’ includes ‘-und’, ‘-aum’ includes ‘-um’ as a substring — when in fact we need a method that treats the diphthong as a single unit, then we can search for ‘-um’ words that are true ‘-um’ words, and not ‘-aum’ confounds.
  • For example, ‘-um’ is often cited as an ending for neuter nouns, but, here it only shows a predictive certainty of 0.72 — what’s going on?
    It turns out, ‘-um’ gets “overridden” by superstring endings such as ‘-aum’, which indicates 100% masculine: ‘der Raum’, ‘der Baum’, ‘der Traum’…
  • EDIT: I’ve updated the script to fix the phoneme-splitting issue, replacing non-splittable sequences ‘ch’, ‘ck’, ‘pf’, ‘qu’, ‘ai’, ‘ei’, ‘au’, ‘eu’, ‘äu’, and ‘ie’ with single characters for tallying. Splittable sequences are not swapped, for example, ‘th’ occurs in “Hauptthema” (non-splittable), but also across morpheme boundaries in “Aufenthalt” (splittable). Without making the rules more complex, ‘sch’ is also problematic in words like ‘Häuschen’…

Top feminine endings

Ending, probability, gender, example words, exceptions
‘V:’ indicates a long vowel

Top masculine endings

Top neuter endings

Most exceptions have an easy explanation

  • Many endings are almost 100% — the reason it’s not 100% is that there are a handful of words (often just one or two) that are exceptions in gender.
  • In “Ara”, “Zebra”, the gender is explained by ontological derivation of animals, so is “Matrose” (person) and many demonyms with otherwise entirely feminine endings, or “Dunst” (weather)
  • There’s also loanwords that get ontologically transferred gender “der Cyberspace” (from “der Raum”), “der Breakdance” (from “der Tanz”) when ‘-ce’ is generally a feminine ending.
  • The substring ‘sis’ in “das Chassis” in not the same etymological morpheme ‘-sis’ in Greek words — the last ‘s’ is even silent…
  • Prefix overrides also occurs in “der Verdacht” and “das Gemenge

Families of endings

As a natural rule, longer endings have a narrower set of children, therefore their predictive power goes up. For example, we know that ‘-in’ is a personal ending that is feminine, should most nouns in ‘-in’ be feminine then? It’s about 90%, because the stats are diluted by high-frequency exceptions like ‘der Termin’. But, if we look at the child-endings (superstrings) of the ‘-in’ ending, e.g. ‘-gin’, ‘rin’, ‘-tin’, the predictive power becomes very certain.

Thus, one of the phenomena to notice is that the predictive endings come in “families”, whereas the “head” (the shortest) ending starts with a moderate certainly, its children collectively add a lot more coverage.

The ‘-er’ family of endings with predictiveness, example words, and exceptions
The ‘-le’ and ‘-re’ families of endings
The exception in ‘-anz’ is a morpheme boundary artefact. The ‘-nis’ family is interesting though, most words are formed with a native stem (e.g. ‘Ergeb-’, ‘Kennt-’, but not ‘Pe’) plus the suffix, but they differ in gender…
The family of nominalised verbs, that is, a capitalised infinitive (ending in ‘-en’) used a noun, which is neuter, but there are also mundane native nouns that happen to end in ‘-en’

Umlaut makes a difference

Example: ‘-uck’ : 100% masculine; ‘-ück’ : 100% neuter

Long vowel makes a difference

Words ending in ‘-er’ are 85% masculine, but words ending in ‘-er’ preceded by a long vowel (-V:er) are 100% masculine. Words ending in ‘-e’ are 88% feminine, but ‘-V:e’ words are 100% feminine.

Homograph/homophone differing by sense

We could also say it differs by whether the ‘-er’ is morphological in nature, or just an incidental substring — i.e. a ‘ladder’ is not someone that does ‘ladding’:

das Messer (knife)
der Messer (a measuring device, analyser)

Some morphological stems is in a indeterministic state… e.g.-nis’ is sometimes neuter, sometimes feminine, even though the both branches follow the same word-formation logic. When the ending is also a word on its own, dictionaries often note they are of multiple genders, e.g. ‘der Vorteil’ and ‘der Nachteil’ derives from ‘der Teil’ and inherits its (primary) gender, but not ‘das Gegenteil’, or ‘das Urteil’.

Synonyms differing in gender

der Bereich, die Region, das Gebiet
der Grund, der Boden, die Erde, das Feld
der Apparat, die Maschine, das Gerät

Ontological attribution is a spectrum, sometimes it’s pretty clear, sometimes it resembles tabloid astrology. You just get a bunch of friends sitting in an editorial room to brainstorm up retrospective rationalisations.

Final ‘e’ makes a difference

While ‘-ck’ is less than 0.4% feminine, ‘-cke’ is 100% feminine. If we drop the final ‘e’ in ‘die Ecke’, it becomes ‘das Eck’.

Homophonic endings differing in gender

Collective nouns ending in ‘-ei’ are feminine, then, there are some non-collective words that end with the same diphthong, but are masculine, they are spelled with ‘-ai’.

Prefix takeover

die Lust’, but ‘der Verlust
die Menge’, but ‘das Gemenge
* Then again, exceptions: ‘die Gewalt’, ‘die Gefahr’, ‘der Genuss’, ‘der Gedanke’

A (rather cryptic) hypothesis about German noun genders

For a formless concept to incarnate itself as an enunciable word, it must adopt elements from a morphological repertoire. There is an intricate interplay between the ontological constitution of the concept, and the hypo-morphological skeleton inherited from etymology, and the choices of accepted endings through which the forms manifest. Rules cascading upon rules create the illusion of many exceptions.

An efficient learning system

  • Before analysing the endings, collapse all compounds and attribute compounds’ frequency weight to the their base words.
  • Use true morphemes instead of n-grams. For example, ‘-mus’ is a morpheme, while ‘-us’ is not. The n-gram tree also gives you “Bus”, which is unrelated to all the other words with ‘-us’ via “-mus”.
  • Then it will become clear that for each true morpheme ending, there are just a handful of base words, or base words sharing the same morpheme, belonging to either just one gender, or in some cases two.
  • For example, ‘-tät’ and ‘-enz’ are morphological endings that beget many base words which beget more compounds — while the substring ‘-lg’ (along with ‘-olg’, ‘-folg’, ‘-rfolg’…) is just an artefact of ‘Erfolg’ (and no other base words) which has a number of compounds.
  • Have some intros for understanding patterns of syntactically generated nouns (e.g. adjectival nouns, nominalised verbs, etc.)
  • Prioritise by usefulness (aggregated frequency weight), certainty, and predicted ease of recall — we also need a more sophisticated concept of “certainty”, so that we not only look at endings that are 100% predictably of one gender, but also those cleanly split say 60% m. and 40% n. with only 4 base words and no exceptions, or 100% f. excluding demonyms…
  • Learn the association using words as concrete examples, instead of memorising the substrings or endings. Namely, the task is to learn the gender of real words, in batch, rather than to learn the endings. Your mind would automatically do the extrapolation to comprehend the connections.
  • Show conforming examples, as well as exceptions — when convenient, add explanations for the exceptions (e.g. demonyms, ontological overrides)

Next Steps

  • An app (yet another?) with dynamic search type-ahead, showing examples and exceptions…
  • An ontology brainstorming game — based on synonyms, based on WordNet — in the style of Apples & Apples, Cards Against Humanity where participants come up with explanations and compete on connotations
  • This noun gender exercise is already halfway to solving the larger problem of “compacting the vocabulary” — it’s not so much about memorising (item by item) but about understanding the internal structures, then “Voila!” it’s 20x more efficient.

--

--

David Rosson
Linguistic Curiosities

Jag känner mig bara hejdlöst glad, jag är galen, galen, galen i dig 🫶