Natural Language Processing
“All that we are is the result of what we have thought.
The mind is everything. What we think we become.”
~ The Buddha
NLP is a branch that consists of analyzing, understanding, and deriving information from the text data in a smart and efficient manner.
NLP is used to analyze text, allowing machines to understand how human’s speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.
Some basic tasks of NLP are:
- Tokenization — process of converting a text into tokens
- Tokens — words or entities present in the text
- Text object — a sentence or a phrase or a word or an article
By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as — automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
How to represent the sentence for processing?
There are multiple ways to build a sentence vector representation:
1. Bag-of-words: Where we have each word as a dimension (and hence sentence is a vector with dimension |V|, where V is the vocabulary). Each word dimension is given a value equal to number of word occurrences in the sentence.
2. TFIDF based: Where in bag of words , tf–idf is used instead of number of word occurrence in sentence.
3. Word embedding: For representing words, a sentence vector is made by neural networks (recursively combining word embeddings with a generative model like recursive/recurrent Neural Net or using some other non-neural network algorithm like doc2vec for this purpose). Here the sentence vector generally has a similar shape when compared to word embeddings.
The following models a text document using bag-of-words.
Here are two simple text documents:
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
Based on these two text documents, a list is constructed as follows:
["John","likes","to","watch","movies","Mary","too","also","football","games"]
In practice, the Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a “bag of words”, we can calculate various measures to characterize the text. The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text. For the example above, we can construct the following two lists to record the term frequencies of all the distinct words:
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0](2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Conceptually, we can view bag-of-word model as a special case of the n-gram model, with n=1.
Connect with ArIES to know more.
See the following links for implementation of various NLP tasks:
https://www.kaggle.com/c/word2vec-nlp-tutorial
Important Libraries for NLP (python)
- Scikit-learn: Machine learning in Python
- Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
- spaCy — Industrial strength N LP with Python and Cython.
- Gensim — Topic Modelling for Humans
- Stanford Core NLP — NLP services and packages by Stanford NLP Group.
