NATURAL LANGUAGE PROCESSING (NLP)
(PART 1-KEY TERMS)
Hello everyone:) I’m here with my first post in Medium. I would love for you to accompany me on this adventure. If you ready, ladies and gentlemen sit down and fasten your seatbelts. Now, we dive deep into Natural Language Processing and start exploring.
Wellll, What exactly is Natural Language Processing(NLP) and why is it so popular? NLP is a field of study consisting of the combination of artificial intelligence, computer science and linguistics.Our goal is for machines to understand and use the language people speak.The main areas where NLP is used are summarization, text classification and categorization, sentiment analysis, speech recognition, machine translation, question answering, part of speech tagging, named entity recognition, spell checking and so on.
❤ NLP BASIC CONCEPTS
𝄞 TOKENIZATION
Tokenization is the process of splitting a sentence, a paragraph or a text document into smaller units such as individual words or terms.
#from nltk.tokenize import sent_tokenize, word_tokenize#text = “Cognitive psychology is the study of thinking, concept #information and problem solving. While it is a relatively young #branch of psychology, it has quickly grown to become one of the #most popular subfields. “#word_tokenize(text)
['Cognitive',
'psychology',
'is',
'the',
'study',
'of',
'thinking',
',',
'concept',
'information',
'and',
'problem',
'solving',
'.',
'While',
'it',
'is',
'a',
'relatively',
'young',
'branch',
'of',
'psychology',
',',
'it',
'has',
'quickly',
'grown',
'to',
'become',
'one',
'of',
'the',
'most',
'popular',
'subfields',
'.']#sent_tokenize(text)['Cognitive psychology is the study of thinking, concept information and problem solving.',
'While it is a relatively young branch of psychology, it has quickly grown to become one of the most popular subfields.']
❀ STOP WORDS
These are actually the most common words in any language and do not add much information to the text. Some examples of stopping words are “the”, “a”, “an”, “so”, “what” in English.
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize#words = word_tokenize(text)#filtered_words = []
for word in words:
if word not in stopwords:
filtered_words.append(word)
#filtered_words['Cognitive',
'psychology',
'study',
'thinking',
',',
'concept',
'information',
'problem',
'solving',
'.',
'While',
'relatively',
'young',
'branch',
'psychology',
',',
'quickly',
'grown',
'become',
'one',
'popular',
'subfields',
'.']
♕ STEMMING
Stemming is basically removing the suffix from a word and reduce it to its root word. For instance
#from nltk.stem import PorterStemmer#ps = PorterStemmer()#words = ['Consult', 'Consultant', 'Consulting', 'Consultantative', 'Consultants', 'Consulting', 'cats','children','go','went']#for w in words:
print(ps.stem(w))consult
consult
consult
consult
consult
consult
cat
children
go
went
The words ‘children’ and ‘went’ did not change at all because when stemming, only the suffixes at the end of the word are being processed.
☀ LEMMATIZATION
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
#import nltk#nltk.download('wordnet')[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\Kullanıcı\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!True#from nltk.stem import WordNetLemmatizer#lem = WordNetLemmatizer()#words = ['study','studying','studies','begin','began','begun','leaf','leaves','child','children']#for w in words:
print(lem.lemmatize(w))study
studying
study
begin
began
begun
leaf
leaf
child
child
Have you noticed the similarity between Stemming and Lemmatizing. When we do Stemming, only the suffixes at the end of are cut off. Thanks to Lemmatizing we can really get to the root of the word in the dictionary. If your aim is to get root, i recommended using Lemmatizing.
#lem.lemmatize('studying')
'studying'#lem.lemmatize('studying','v')
'study'#lem.lemmatize('begun')
'begun'#lem.lemmatize('begun','v')
'begin'
So, why couldn’t we find the roots of the words ‘studying’ and ‘begun’ with Lemmatization? Because these words were perceived as noun and could not be separated to their roots.
ღ PART OF SPEECH TAGGING
It is a quite simple technique. With this technique, each sentence is divided into its elements and each word type is labeled.
#import nltk#nltk.download('averaged_perceptron_tagger')[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] C:\Users\Kullanıcı\AppData\Roaming\nltk_data...
[nltk_data] Unzipping taggers\averaged_perceptron_tagger.zip.True#text = 'Your manager stops you and says she needs to have a word about your performance in the recent project. You worry about it all weekend, wondering what you might have done wrong.'#tokenized = nltk.word_tokenize(text)#nltk.pos_tag(tokenized)[('Your', 'PRP$'),
('manager', 'NN'),
('stops', 'VBZ'),
('you', 'PRP'),
('and', 'CC'),
('says', 'VBZ'),
('she', 'PRP'),
('needs', 'VBZ'),
('to', 'TO'),
('have', 'VB'),
('a', 'DT'),
('word', 'NN'),
('about', 'IN'),
('your', 'PRP$'),
('performance', 'NN'),
('in', 'IN'),
('the', 'DT'),
('recent', 'JJ'),
('project', 'NN'),
('.', '.'),
('You', 'PRP'),
('worry', 'VBP'),
('about', 'IN'),
('it', 'PRP'),
('all', 'DT'),
('weekend', 'NN'),
(',', ','),
('wondering', 'VBG'),
('what', 'WP'),
('you', 'PRP'),
('might', 'MD'),
('have', 'VB'),
('done', 'VBN'),
('wrong', 'JJ'),
('.', '.')]
✩ NAMED ENTITY RECOGNITION
Named entity recognition is a NLP technique that automatically identifies named entities in a text and classifies them into predefined categories.Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more. With Named Entity Recognition, you can extract the information that you want from the texts. Let’s do an example!
#import nltk#nltk.download('maxent_ne_chunker')[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] C:\Users\Kullanıcı\AppData\Roaming\nltk_data...
[nltk_data] Unzipping chunkers\maxent_ne_chunker.zip.True#nltk.download('words')[nltk_data] Downloading package words to
[nltk_data] C:\Users\Kullanıcı\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\words.zip.True#text = "Let’s put this topic on the back burner until next month because we have more urgent matters."#tokenized = nltk.word_tokenize(text)#tagged = nltk.pos_tag(tokenized)#named_ent = nltk.ne_chunk(tagged)#named_ent
TAKE AN ERROR!!!! For example as below,The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`
If you run the code “named_ent” , you get an error. Because NLTK is trying to run Ghostscript. However, since we don’t have Ghostscript, it gives an error. But end the end of the error, it gives us the output like this.
Tree('S', [('Let', 'VB'), ('’', 'PRP'), ('s', 'VB'), ('put', 'VB'), ('this', 'DT'), ('topic', 'NN'), ('on', 'IN'), ('the', 'DT'), ('back', 'JJ'), ('burner', 'NN'), ('until', 'IN'), ('next', 'JJ'), ('month', 'NN'), ('because', 'IN'), ('we', 'PRP'), ('have', 'VBP'), ('more', 'RBR'), ('urgent', 'JJ'), ('matters', 'NNS'), ('.', '.')])
♫ BAG OF WORDS
Bag Of Words is a numerical representation of the texts. It is a technique of pre-processing the text by converting it into a number or vector format, and this keeps the total number of occurrences of the most frequently used words in the document.
‘They go to cinema’,
‘The girl doing makeup’,
‘The boy and the girl walked’
{‘they’: 8, ‘go’: 5, ‘to’: 9, ‘cinema’: 2, ‘the’: 7, ‘girl’: 4, ‘doing’: 3, ‘makeup’: 6, ‘boy’: 1, ‘and’: 0, ‘walked’: 10}
The list contains 11 unique words: the vocabulary. That’s why every document is represented by a feature vector of 11 elements. The number of elements is called the dimension.
Then we can express the texts as numeric vectors:
[[0 0 1 0 0 1 0 0 1 1 0]
[0 0 0 1 1 0 1 1 0 0 0]
[1 1 0 0 1 0 0 2 0 0 1]]
We look at each word in the dictionary in order from 0 to n. If there is that word in our sentence, we write 1 to the list, otherwise we add 0.
#from sklearn.feature_extraction.text import CountVectorizer#corpus = [ 'They go to cinema', 'The girl doing makeup', 'The boy and the girl walked',]#vectorizer = CountVectorizer() print( vectorizer.fit_transform(corpus).todense() ) print( vectorizer.vocabulary_ )[[0 0 1 0 0 1 0 0 1 1 0] [0 0 0 1 1 0 1 1 0 0 0] [1 1 0 0 1 0 0 2 0 0 1]] {'they': 8, 'go': 5, 'to': 9, 'cinema': 2, 'the': 7, 'girl': 4, 'doing': 3, 'makeup': 6, 'boy': 1, 'and': 0, 'walked': 10}
❄ N — GRAMS
N-grams represent a continuous sequence of N elements from a given set of text. In a broad sense, such items do not mean word strings, they can also be phonemes, syllables or letters depending on what you want to achieve.
Analysis of a Sentence
#pip install -U textblob #python -m textblob.download_corpora#from textblob import TextBlob#sentence = "Children can’t get on rollar coaster without parents."
We've created a sentence
string containing the sentence we want to analyze. We've then passed that string to the TextBlob
constructor, injecting it into the TextBlob
instance that we'll run operations on:
#ngram_object = TextBlob(sentence)#ngrams = ngram_object.ngrams(n=2)
#print(ngrams)
The ngrams()
function returns a list of tuples of n successive words. In our sentence, a bigram model will give us the following set of strings
[WordList(['Children', 'can’t']),
WordList(['can’t', 'get']),
WordList(['get', 'on']),
WordList(['on', 'rollar']),
WordList(['rollar', 'coaster']),
WordList(['coaster', 'without']),
WordList(['without', 'parents'])]
SOURCES