This post will take you a beginner's guide to Natural Language Processing. A language is a way we humans, communicate with each other. Each day we produce data from emails, SMS, tweets, etc. we must have methods to understand these type of data, just like we do for other types of data. We will learn some of the basic but important techniques in Natural Language Processing.
What is Natural Language Processing (NLP)?
As per Wikipedia:
Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data.
In simple terms, Natural language processing (NLP) is the ability of computers to understand human speech as it is spoken. NLP helps to analyze, understand, and derive meaning from human language in a smart and useful way.
NLP algorithms are machine learning algorithms based. NLP learns by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference, instead of coding large sets of rules. We can organize the massive chunks of text data and solve a wide range of problems such as — automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
Let’s dive deeper…
As we all know, the text is the most unstructured form of all the available data. It is important to cleaning and standardize this text and make it noise free. The idea is to take the raw text and turn into something which can be utilized by an ML algorithm to carry out prediction. We will talk about few important techniques using NLTK.
We break the articles into sentences. Often we have to do analysis at sentences level. For example, we want to check the number of sentences in an article and number of words in a sentence.
Tokenization breaks unstructured data, text, into chunks of information which can be counted as discrete elements. This immediately turns an unstructured string (text document) into a more usable data, which can be further structured, and made more suitable for machine learning. Here we take the first sentence and we get each word as token. Below are two different ways i.e RegexpTokenizer & Word Tokenize.
Consider words like a, an, the, be etc. These words don’t add any extra information in a sentence. Such words can often create noise while modelling. Such words are known as Stop Words. We filter each sentence by removing the stop words as shown below:
Stemming And Lemmatization
Some words represent the same meaning. For example, Copy, copied, copying. The model might treat them differently, so we tend to strip such words to their core. We can do that by stemming or lemmatisation. Stemming and Lemmatization are the basic text processing methods for English text.
It helps to create groups of words which have similar meanings and works based on a set of rules, such as remove “ing” if words are ending with “ing”. Different types of stemmers in NLTK are PorterStemmer, LancasterStemmer, SnowballStemmer.
It uses a knowledgebase called WordNet. Because of knowledge, lemmatization can even convert words which are different and cant be solved by stemmers, for example converting “came” to “come”.
These are few basic techniques used in NLP. I hope I’ve given cleared some of the basic and important concepts in NLP as this is the building block of many other NLP concepts. To learn more about NLTK, visit this link.
Thanks for reading! ❤
Follow for more updates!