Natural language processing (NLP) is the application of computational methods to not only extract information from text but also model different applications on top of it. All language based text have systematic structure or rules which is often referred as morphology, for example the past tense of “jump” is always “ jumped”. For humans this morphological understanding is obvious.
In this introductory NLP blog, we will see different methods to pin down the morphological structure and rules of language.
The task of segmenting text into relevant words is called tokenization.
In simplest form, tokenization can be achieved by splitting text using whitespace. NLTK provides a function called word_tokenize() for splitting strings into tokens.
text = 'we will look into the core components that are relevant to language in computational linguistics'
But simple tokenization doesn’t work all the time. In case of complex words which involves punctuation marks in between words ( Example: what’s)
If we want to preserve that word with punctuations, simple hack is that we can split the text into words by white spaces abd replace all punctuation with nothing.
Stemming and Lemmatisation
Task of reducing each word to its root . For example “Walk” is the root for words like “Walks”, “Walking”, “Walked”. Usually the root may hold significantly more meaning than the tense itself. So in NLP tasks it’s very important to extract the root for the words in the text.
Stemming helps in reducing the vocabulary present in the documents, which saves a lot of computation. Also in the tasks like classification, tenses of words are rendered irrelevant once stemming is applied.
Most popular method is the Porter Stemming algorithm. Its a Suffix stripping algorithms which do not rely on a lookup table that consists of inflected forms and root form relations. Some simple rules are built for extracting the root words.
lemmatisation does very similar to stemming as it removes inflection and suffixes to convert words into their root words .meaning and context can be lost in the Stemming, lemmatisation preserves the context.
Normalizing Case: It is common to convert all words to one case.
Stop Words: Stop words are those words that do not contribute in the process of extracting/ modelling on the text data because they are the most common words such as: “the“, “a“, and “is“.
Data Cleaning: Before applying complex computational methods on the text data, we are expected to understand and clean the data. These techniques help us make the text ready for modelling with advanced DNN and NLP techniques.