Text Normalization of NLP in .NET

Haiping Chen
SciSharp STACK
Published in
2 min readSep 5, 2018

Almost all NLP engineers now use Python or Java because of the open source NLP toolkits such as NLTK, CoreNLP, SpaCy and OpenNLP, as well as machine learning algorithm libraries like Scikit-Learn. If you are a .NET developer, it is very difficult to use C# to do some NLP work. Although there are also Microsoft open source libraries like ML.NET, it is not a specialized NLP toolkit. Many natural language processing-specific libraries are not found in them.

So I decided to start my own NLP toolkit from scratch. I will implement some common functions and model algorithms completely with C#. It is not easy to implement each step of NLP Pipeline from scratch. I am planning to write a series of blogs to record this project. I am not sure I can complete the whole project, but I have been researching this field for two years in NLP. I will continue to learn and implement in code in this field.

Normalizing text means converting it to a more convenient, standard form. Most of what we are going to do with NLP relies on separating out or tokenization for text. English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll are sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate I’m into the two words I and am. For processing tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc. Some languages, like Chinese, don’t have spaces between words, so word tokenization becomes more difficult.

One commonly used tokenization standard is known as the Penn Treebank toPenn Treebank tokenization kenization standard. There is original implementation here, and a Python implemented here , and Java from here.

No C# implementation? Not true. You can find here.

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank standard. I ran it and screenshot below:

It seems to run very well. I really like the result with the start pos and end pos, there is no such function in NLTK.

--

--