Importance of data preprocessing in NLP/NLU

Textual data processing is one of the interesting problems in Machine Learning today. Textual data and requirement to process it comes from many aspects of our life — say you make a verbal note to setup a reminder, you write with your hand a message in your notebook, which later should be converted from image into the text. Once text is available it needs to be analysed by machine and correctly processed.

Say one day you get idea — you are going to classify all the chat messages in your project slack into relevant / irrelevant to some problem. It might happen that there is general data processing channel dedicated for different problems, like, hadoop related issues, spark scala code questions, or even UI/UX for dashboards.

Such a chat group might have overwhelming number of entries on a daily basis. Yes, one of the possible solutions is to split that channel in multiple, but that at some point going to be a nightmare — navigate through all the histories and find the full story on a single topic. Discussion can start with general architecture discussion, that forwarded to ops channel and development channels for inter-team discussions, and land in QA/UAT channels.

Before a long weekend you start with digging into the problem space, making a slack bot that would collect information, and make a full history of the messages on a single story. Integration with slack API is quite easy, and that problem solved by you in just a few hours, and… you have bunch of records <Ti, Mi>, where Ti is timestamp of message and Mi is text sent by a person. In our case we are less interested in the author and channel, though they might be important as well.

We are going to shift our focus below to the textual data pre-processing stage. Which might contain number of various interesting steps, but for now lets concentrate on the basics, and one of them is processing raw data set. Lets choose the very basic approach for initial step — stemming as a core component for text preparation.

Stemming usually comes with lemmatisation, and both are powerful, though in general lemmatisation should give you a better quality, as it is more sophisticated approach, which digs into the context of the word and tries to infer word initial form from it. Side note here — lemmatisation can be based on not only context, or completely skip the context for the word.

Stemming is implemented and well known for most of thus who by any chance in their life tried to work with textual data. Stemming tries to move the word to it’s normal form, by removing any suffixes and prefixes. For instance, “вожделение” would be stemmed to “вожделен”, for those how does not know russian stemming result — has no meaning for even russian speaking people not only for you. But that would drastically reduce size of the vocabulary of corpus, reduce dimensionality and finally improve performance and accuracy.

And yeah, you now know a bit on that topic, shall we dig here into existing libraries or algorithms for stemming or lemmatisation — i think no. There are number of articles on the internet which describe in detail on how it work, and what is the best way to implement. Instead lets have a look on the results i get — naive way where raw text sent to the classificator brings me: 0.679, while stemmed text: 0.682.

WOW! Only 0.003, but that step, decreased amount of memory required for training process, reduced time it takes, and increased performance.

PS: For those who might be interested my pipeline looks like (text) -> tfidf -> NaiveBayes.