MUSoC-Week 5
The final week before the final evaluations has been a tough one to say the least. Nevertheless, it has been interesting dealing with the various challenges and obstacles faced.
After further testing and analysis of the probability distributions, I was convinced that the results were legitimate. I’ve implemented the frequency and probability distributions myself and obtained similar results. So now, that’s one less thing to worry about.
The Viterbi algorithm was the next challenge. After testing the implementation on the Brown corpus, one can conclude the cubic time complexity severely penalise corpora that employ a lexicon of intricate PoS tags for tagging. In other words, the 400-odd PoS tags and the trigram language model spell disaster in terms of the computational resources required to determine the accurate tag sequence for the input sentence.
Viable solutions to the aforementioned problem include reducing the number of PoS tags or designing a bigram model. These compromises, however, come with their own drawbacks. Reducing the number of tags would defeat the purpose of using detailed corpora and could also lead to ambiguity in tagging if not done properly. Opting for the latter solution could result in inaccurate predictions.
Now, all that’s left is to analyse the effects of each possible solution on the test data and baseline with NLTK.
