NATURAL LANGUAGE PROCESSING (NLP)

(PART 1-KEY TERMS)

--

Hello everyone:) I’m here with my first post in Medium. I would love for you to accompany me on this adventure. If you ready, ladies and gentlemen sit down and fasten your seatbelts. Now, we dive deep into Natural Language Processing and start exploring.

Wellll, What exactly is Natural Language Processing(NLP) and why is it so popular? NLP is a field of study consisting of the combination of artificial intelligence, computer science and linguistics.Our goal is for machines to understand and use the language people speak.The main areas where NLP is used are summarization, text classification and categorization, sentiment analysis, speech recognition, machine translation, question answering, part of speech tagging, named entity recognition, spell checking and so on.

❤ NLP BASIC CONCEPTS

𝄞 TOKENIZATION

Tokenization is the process of splitting a sentence, a paragraph or a text document into smaller units such as individual words or terms.

❀ STOP WORDS

These are actually the most common words in any language and do not add much information to the text. Some examples of stopping words are “the”, “a”, “an”, “so”, “what” in English.

♕ STEMMING

Stemming is basically removing the suffix from a word and reduce it to its root word. For instance

The words ‘children’ and ‘went’ did not change at all because when stemming, only the suffixes at the end of the word are being processed.

☀ LEMMATIZATION

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Have you noticed the similarity between Stemming and Lemmatizing. When we do Stemming, only the suffixes at the end of are cut off. Thanks to Lemmatizing we can really get to the root of the word in the dictionary. If your aim is to get root, i recommended using Lemmatizing.

So, why couldn’t we find the roots of the words ‘studying’ and ‘begun’ with Lemmatization? Because these words were perceived as noun and could not be separated to their roots.

ღ PART OF SPEECH TAGGING

It is a quite simple technique. With this technique, each sentence is divided into its elements and each word type is labeled.

✩ NAMED ENTITY RECOGNITION

Named entity recognition is a NLP technique that automatically identifies named entities in a text and classifies them into predefined categories.Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more. With Named Entity Recognition, you can extract the information that you want from the texts. Let’s do an example!

If you run the code “named_ent” , you get an error. Because NLTK is trying to run Ghostscript. However, since we don’t have Ghostscript, it gives an error. But end the end of the error, it gives us the output like this.

♫ BAG OF WORDS

Bag Of Words is a numerical representation of the texts. It is a technique of pre-processing the text by converting it into a number or vector format, and this keeps the total number of occurrences of the most frequently used words in the document.

‘They go to cinema’,

‘The girl doing makeup’,

‘The boy and the girl walked’

{‘they’: 8, ‘go’: 5, ‘to’: 9, ‘cinema’: 2, ‘the’: 7, ‘girl’: 4, ‘doing’: 3, ‘makeup’: 6, ‘boy’: 1, ‘and’: 0, ‘walked’: 10}

The list contains 11 unique words: the vocabulary. That’s why every document is represented by a feature vector of 11 elements. The number of elements is called the dimension.

Then we can express the texts as numeric vectors:

[[0 0 1 0 0 1 0 0 1 1 0]

[0 0 0 1 1 0 1 1 0 0 0]

[1 1 0 0 1 0 0 2 0 0 1]]

We look at each word in the dictionary in order from 0 to n. If there is that word in our sentence, we write 1 to the list, otherwise we add 0.

❄ N — GRAMS

N-grams represent a continuous sequence of N elements from a given set of text. In a broad sense, such items do not mean word strings, they can also be phonemes, syllables or letters depending on what you want to achieve.

Analysis of a Sentence

We've created a sentence string containing the sentence we want to analyze. We've then passed that string to the TextBlob constructor, injecting it into the TextBlob instance that we'll run operations on:

The ngrams() function returns a list of tuples of n successive words. In our sentence, a bigram model will give us the following set of strings

SOURCES

--

--