Universal Dependencies: a Hidden (Markov) Quest- Drem yol lok!

Pranav Gupta
Analytics Vidhya
Published in
7 min readJun 18, 2020

To Skyrim junkies: the title was a bait to make you read the article. But please give me a clap before you say Gjok Hi and go! :(

Even a toddler will beatifically remark that NLP (natural language processing) is having a field day today. Mindblowing progress is being made for languages that have a lot of data. But at the same time, there are a lot of low-resource languages that don’t get enough attention despite having a significant number of speakers.

Transfer learning and other cool ideas combining SMT (statistical machine translation) and NMT (neural machine translation) have tried to bridge the data gap. Check out XMLR, for instance. Still no toddler can deny the fact that we need more data for all languages, just like we did when our brains learned language(s). UD (Universal Dependencies) is a step in this direction: their goal is to have a consistent grammar annotation across all languages. What is really cool about UD is that they have focused on all sorts of languages: it is likely that you will find a language in their list that you haven’t heard of. In short, UD is a nice starting point for people interested in cross-lingual NLP. (Personally, I couldn’t resist the mouth-watering temptation of playing with Belarusian data.)

Universal Dependencies (from universaldependencies.org)

The goal of this post is two-fold:

  1. get familiar with UD data (version 2.6 used here) and its diversity
  2. quick intro to HMM and its efficiency in POS tagging tasks

Here is how UD data looks like. You can see that certain languages have multiple datasets, whereas certain languages have incomplete/minimal data. I have not considered such languages in this article.

UD subdirectories

In this article I will focus on POS tagging via HMM (Hidden Markov Model). For simplicity, all words are lemmatized before further processing. The *train.conllu, *dev.conllu and *test.conllu files are parsed by using the pyconll package. Each .conllu file consists of the lemmatized form next to each token, so that you don’t really have to bang your head as to how to lemmatize “ترفض”. Moreover, each token also has a corresponding POS tag close by. All these attributes are easily read into pyconll objects. In cases where dev data is available, it is merged into the training data, since we are not iteratively training our HMMs.

Why HMM for POS tagging?

Why study Hidden Markov models?

Hidden Markov Models are ubiquitous in speech processing, gesture recognition, etc. A common feature among all Hidden Markov models is that they all try to capture patterns as a function of time.

For instance, the syllable you pronounce next is a function of what you just pronounced. If I give you the raw audio of each syllable, you want to come up with the best fit phoneme for that vowel. And we all know that future and past words tend to modify how we pronounce the current phoneme. Hidden Markov models are a cut-out for such tasks.

However, a relatively less “jazzy” application of Hidden Markov models is in part-of-speech tagging. Arguably the first step in rigorously understanding the structure of a sentence spoken in any language in the world is to know what each word/token in the sentence does.

Let’s take an example. If “Mary went to the park”, then a computer would want to understand what the role of “Mary” is, what the role of “went” is, and so on. As kids, we all learned that “Mary” is a noun, “went” is a verb, and so on till “park” is a noun. Silly middle school grammar, you grumble.

Well, what if I now say “Let Mary park her car”? Is “park” still a noun here? No, it is a “verb”. Maybe your school teacher worked hard to ingrain this into your mind, but what about a computer? How do we teach a computer to learn context? That is, the phrase after “park” in the second sentence is “her car”, so “park” is likely a “verb” here, whereas in the first sentence, the verb “went” already appeared before “park”, so “park” should be a noun that indicates where Mary went.

This is where hidden Markov models come in. To decide the function of a given word, you need to look at what function the previous word had and what function the next word has. But at the same time, there are no direct dependencies between the current word and a word that occurred 10 spots ago. This is a crucial feature in all Markov models: the past history doesn’t matter too much. Specifically for the part of speech (POS) tagging task we consider here, we assume that POS tags directly depend on the current word (as in “park”) and the previous POS tag. This doesn’t exclude the fact that there might be indirect dependencies, that is, the POS tag 2 words ago can affect your current POS tag.

HMM Schematic

For a mathematical description of HMM, I direct you to Ch. 8 in Jurafsky-Martin, freely available online.

Another HMM schematic, this time with conditional probabilities

The arrows in this figure are really important: these arrows denote a graphical representation of how inference occurs in an HMM. The training process for such a POS tagger involves counting the conditional probabilities corresponding to each arrow in this picture. An arrow from “NNP” to “Janet” is linked to the probability of the word being “Janet” given that the POS tag at that position was “NNP”. An arrow from “NNP” to “MD” is linked to the probability of the current tag being “MD” given that the previous tag was “NNP”.

There are 2 more arrows actually: you need to know where you start and end the sentence, that is, the probability that the first POS tag is “NNP” given that it is the starting word, and the probability of having ended a sentence given that the current tag is a punctuation, for example, a period symbol “.”. Note that I am considering punctuations to be a part of the sentence, in the spirit of the POS tagset.

Training and testing an HMM model can be smoothly done by using the Pomegranate library. I have used Pomegranate in this article.

Training HMM models and predicting tagging accuracy:

Back to the UD data. The goal here is breadth rather than depth: I wanted to cover as many languages as possible and test the tagging accuracy, therefore I did not implement additional features like:

Unknown tokens during test time were replaced by ‘nan’, and Pomegranate ignores them while evaluating POS tags.

Tagging accuracy is calculated by simply dividing the number of correctly predicted tags by the total number of tags.

Another important aspect about UD data is that it’s unclean. For instance, certain languages like Bhojpuri were missing training data, and certain datasets like UD_Hindi_English-HIENCS/ had blank entries in their .conll files.

Results:

Visualizing this data is a mundane yet important task. I plot here the bar charts of train and test accuracies for the various languages in the `plot_datasets` list. As you can see, there are many European languages among this data. A nice thing to make bar plots less cluttered, I am dividing the languages into 2 sets: European and non-European.

European Languages Test+Train Accuracy
Non-European Languages Train+Test Accuracy

Conclusion:

The most important takeaway is that HMMs tend to nail down the POS tagging task very well, even when we don’t have humongous amounts of data. A few thousands of training sentences already result in >85% accuracy. Albeit, this could change when you change your data source, curate your data differently, and so on. Moreover, we see good performance (>85%) for a wide variety of languages. This shows the generality of HMM in capturing POS dependencies.

The code for this project is available at https://github.com/prannerta100/ud-pos-tagger/. Fus Ro Dah!

Feel free to leave comments and suggestions!

Force, Balance, Push!

--

--