This week marks the end of the coding phase. It has been rather a stressful one with a lot of time being spent on trying to perfect PoS tagger.
The results after baselining weren’t exemplary, but were satisfactory to say the least. On most cases, the tagger failed to assign PoS tags to a few tokens. The reason behind this issue could be the diminutive probabilities generated as the length of the test sentence increases. However, a comforting inference from the results would be that even NLTK’s trigram tagger failed to assign tags to a few tokens, especially in the cases where the input sentence did not belong to the training data.
Another probably with the PoS tagger is that it is incredibly slow. During the development phase, I was struggling with MemoryErrors, which is often a tell-tale sign of poorly optimised code. Despite efforts in optimising the code, and eventually fixing the MemoryError, the Viterbi algorithm was still slow owing to its polynomial time complexity (O(mn³) to be precise).
During the last week, I spent way too much time in trying to comprehend the frequency and probability distributions generated from the corpus. Finally after all the time spent on testing different language models, reducing the number of PoS tags and testing on small chunks of data from the corpus I ended up using my own distributions instead of NLTK’s. If I would’ve dedicated more time to the tagging algorithm, I might’ve been able to achieve better results.
Ultimately, I can’t deny the fact that I’m happy with the results. At the end of the coding phase, I believe I have fulfilled all the promises made in the proposal within the given time period. Now its time to wait for the results of the final evaluations to be announced.
After six long weeks of coding, MUSoC has lived up to my expectations. I have, without a doubt, learnt a lot this summer. My mentor, Yash Kumar Lal, has been really helpful throughout clarifying all my doubts and proposing suggestions in difficult situations. In conclusion, this experience has fuelled my passion for data science and has motivated me to explore this field of computer science.