Modelling Morphological Features


Languages use suffixes and prefixes to convey context, stress, intonation, and grammatical meaning (like subject-verb agreement). Such suffixes and prefixes form a more general class of entities which are the meaningful sub-parts of a word; these are called as morphemes. A language’s morphology refers to the rules and processes through which morphemes are combined; this allows a word to express its syntactic categories and semantic meaning. For example, in English, a verb can have three tenses: past, present, and future. These are the inflected forms’ of the verb.

Morphology plays a central role in our understanding of the language [1]. This mechanism allows a ‘word stem’ (base form of a word, also referred to as ‘lemma’ or ‘lexeme’) to express various linguistic features like gender, mood, tense, quantity, case, etc. Therefore, formation of a word directly influences the patterns and grammatical meaning in sentences. For example, in English, consider the following words: elderly and rapidly. Despite the fact that both the words contain the same suffix, , it functions differently. When we add the suffix to the noun elder, we create an adjective. However, when we add to the adjective, rapid, we create an adverb that describes how fast something is done.

Languages which use morpheme combination (inflection) extensively to generate words are said to be “morphologically-rich” and different languages exhibit inflection in varying proportions. For example in Russian, a noun can have up to 10 distinct forms and imperfective verbs can have up to 30 forms. And verbs in Archi can display thousands of inflected forms [1]. On the contrary, analytic-languages (e.g., Mandarin) do not use suffixes or prefixes at all.

The need for modelling Morphology

Image taken from Google

In morphologically rich languages, new words are generated/created frequently through prefixation, suffixation, and other processes; it is conceivable that a speaker may not have encountered an inflected form previously. Humans deal with rich morphology in two ways, (a) predicting sub-word structures, and (b) predicting combination rules for morphemes. Consider the following words in English: unbreak, uncry and re-gifter. Although these are novel (new) words, an English speaker can understand/guess their meaning owing to the general usage of the prefixes (which undoes an action) and (which repeats an action).

English verbs play and walk show similar imperfective forms. Such systematic relations of inflected forms allows humans to make generalizations.

Rich morphology usually creates a sparsity in data since a lexical entity can appear in many inflected forms and the likelihood of seeing a word is low. In spite of a plethora of inflected forms for a given lemma, native and bilingual speakers can predict with ease the correct variant dictated by the rules of the language. This can be attributed to the fact that inflected forms can be systematically related to each other.

Can we learn the distribution of morphological features given an inflected form?

Given an inflected word, we first want to map its morphological features. This is an instance of multi-label classification — a morphological feature can either be ‘present’ or ‘absent’ in a word. In this work, we focus on a probabilistic approach which will help us build a generative model of inflected forms. A straight-forward approach to this would be to learn the parameters of a categorical distribution over the morphological features.

Graphical model for morphological feature prediction from inflected word form

Consider the following simple model which takes as input the inflected form of a word and predicts its morphology tag. Suppose we denote the inflected word sequence with and its morphological features as y = [y₁, y₂, …, yₘ], where is the number of features. N is the number of data points and θ denotes the network parameters. We assume that
each morphological feature is drawn from a Bernoulli distribution, with denoting a neural network. Assuming independence between the morphological features, the distribution over y is:

where is a multi-layered perceptron (MLP) parameterized by θ with x as the input. Let denote the distribution of the training data. The model can be trained by maximizing the negative log likelihood:

Data, Architecture and Experiments

We use the dataset provided for the SIGMORPHON 2016 [2] shared task on morphological reinflection over 10 languages. The dataset also included languages that are considered ‘low resource’ but morphologically rich. The shared task involved analysis of an already inflected word and generation of new forms that were not seen. The training data contains an incomplete paradigm; and a dictionary of lemmas is not present.

Model architecture for predicting morphological features for an infected word

We use bi-directional Gated Recurrent Units (GRUs) for encoding the inflected word. At each encoding time step, we concatenate the forward and backward hidden states, u = [hₐ; hᵣ], where hₐ and hᵣ denote hidden state of forward and backward RNN encoder.

Since a prediction containing a subset of the morphological features is also acceptable, we compute precision and recall for each prediction. The mean precision and recall over the entire test dataset using the Bernoulli distribution are 0.7185 and 0.9588 respectively. We can observe that when the parameters are not dependent, the model optimizes for high recall and low precision.

Above figure shows a few examples from the Turkish dataset. We show the target labels in the first row, followed by sample from Bernoulli. We observe that Bernoulli distribution tends to over-predict some features.


The presence or absence of a morphological feature can affect the semantic meaning conveyed by the word. In this work, we aimed to model its distribution. We implemented a deep neural network to predict the morphological features for a word by optimizing the parameters of the distribution. Since the morphological features are binary, we assumed a Bernoulli distribution and trained the network to optimize its parameters. While the performance of this network is good, in a semi-supervised setting an obvious limitation is the inability to perform back propagation through the discrete random variables.

In reality, many languages have features that are dependent on each other. For example, in English a cannot have associated with it. Our original formulation assumes a mean-field approximate posterior. To induce auto-regressive property over the parameters, MADE (masked dense layer)[3] can be used.


  1. Kibrik, A. E. (1998). The handbook of morphology. pages 455-476
  2. The sigmorphon 2016 shared task — morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 10–22.
  3. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store