Andres Hohendahl
Sep 1, 2018 · 1 min read

I think this problem of context NLU and WSD is addressable, we only need a good representation of ideas, as a superclass of words, embedded in sentences. The goal is find a dense representation of this things, where the “context” is nothing more than operate in a restrictive pseudo-mathematic way over the whole meaning, like a special lense, or intersection punch hole.

On my travel towards NLP, we wanted to make robust lemmatization, and this apparently simple problem is horribly NP hard, specially on highly inflected and agglutinative languages like Spanish, German, French, Portuguese, Italian, etc.

So I did not let me down on this, and drilled down, doodling towards an “elegant” solution, trying to linearize the exponential nature of the defective agglutination and inflection. Struggling enough on the fact, including ill-behaved spell errors, I finally came out with a simple and elegant idea.

The final lemmatizer, is a pseudo-statistical + rule based module, capable of detecting, spell-correcting and inferring almost any parasintetic word in any specific language (built initially for Spanish) In this language, the algorithm is capable o handling 5k inflection rules, and 2k prepositional rules in abreeze, reaching >200k words/second on clean tex with a single Xeon server core, and even repairing >1k words/sec on a ill-behaved spell-dirty text. This has been built on C# under .net 4.5 framework, and I am really proud of it.