NLP 101 — Data Preprocessing & Representation Using NLTK.

An insight into how vital a role data pre-processing and representation play in Natural Language Processing and how to go about it.

Anmol Pant
CodeChef-VIT
10 min readAug 2, 2020

--

NLP or Natural Language Processing primarily deals with how machines understand, convert and perceive textual data present in human-readable languages into formats that they can perform computations on. Contemporary corporates often work with huge amounts of data. That data can be moulded into a variety of different forms and formats including text documents, emails, tweets, blog posts, spreadsheets, audio recordings, JSONs, online activity logs and more. One of the most common ways that such data is recorded is via text. This text usually draws parallels to the natural languages that we use in our day-to-day conversations, both online and offline.

Image Courtesy: aliz.ai

Natural Language Processing (NLP) is the discipline of programming computers to process, analyze and parse large amounts of this very natural textual data to build effective and generalized machine learning models. But, unlike its numeric counterpart, before this textual data can be used to build models, it must first be preprocessed, visualized, represented and moulded in such a way that it can be effectively handled.

This is where python’s NLTK or Natural Language Processing Toolkit module comes into the picture. NLTK is one of the leading platforms out there used to work with human language data. It provides ready to use, convenient methods of data handling and pre-processing that are most commonly deployed to mould the human-readable text into a workable format.

As this article adopts a more hands-on approach to using NLTK as a framework, I won’t be delving too deep into all the jargons and terminologies associated with it and will be addressing only the ones that are either most commonly used, or will be used in the implementation that follows. If you are still curious, do check out this blog by one of our Core Committee members, that explains all the terms one might come across whilst using NLTK.

So, let’s get started.

Tokenization

Tokenization is the process of segmenting and dividing the given block of text into sentences or words. That is, all punctuation marks and special characters are first removed and the character strings then obtained without spaces are considered as tokens (also called terms). The primary advantage it offers is that it gets the text into a format that’s easier to convert to raw numbers, which can be used for actually processing the data.

Stemming

In layman’s terms, stemming usually refers to the crude heuristic process that chops off the ends of words in the hope of achieving its goal of reducing the word into its base or root form correctly. This approach often includes the removal of derivational affixes and hence the root words obtained post stemming, might not always make perfect sense.

Lemmatization

Lemmatization usually refers to reducing the word into its base or root form with the use of vocabulary and morphological analysis of words, instead of blindly chopping off the word endings. It normally aims to remove inflectional endings only and returns the base or dictionary form of a word, which is known as the lemma.

Example: If confronted with the token ‘saw’, stemming might return just the letter ‘s’, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.

Stopword Removal

Articles, prepositions, and other common words that appear frequently in the text but do not add any meaning or contribute towards distinguishing documents are called stopwords. The process of their detection and deletion is termed as stopword removal.

Examples: a, an, the, on, in, at, etc.

POS tagging

The process of labelling a word in a text or corpus as corresponding to a particular part of speech, based on both its definition and context.

Document Representation

The collection of words left in the document after performing all of the above steps is different from the original document and may be considered as a formal representation of the document. The purpose of the document representation is to aid the process of keyword matching. However, as some words are either reduced or removed, it may also result in loss of information. The three most common conventions of document represenation are Complete, Boolean and Bag of Words. Let’s look into their nuances one by one.

Complete Representation:

The representation in which the term positions are included along with the frequency is called the ‘complete’ representation. Such a representation preserves most of the information and can be used to generate the original document back from its representation.

Boolean Representation:

The simplest way to use a term as a feature in a document representation is to simply check whether or not the term occurs in the document. Thus, the term is considered to be a Boolean attribute, and this representation is called Boolean representation.

Bag-of-Words Representation:

Document representation that includes just the term frequencies but not the term position, is called a ‘bag-of-words’ representation.

Web Scraping

The technique we will deploy in the upcoming example to extract textual data to work with from different websites.

With all this knowledge about natural language processing and NLTK in our arsenal, we are now ready to see NLTK in action by working on a hands-on mini-project in python3.

Problem Statement

Part 1

Extract the source content from the website (https://en.wikipedia.org/wiki/Natural_language_processing) and display the number of terms and their corresponding term frequency after stopword removal. Also, apply stemming and lemmatization to the same document and display the number of terms along with their corresponding stemmed as well as lemmatized root words. Count the total number of stemmed and lemmatized words hence obtained.

Installing libraries and dependencies

  • requests — an HTTP library in python for parsing URLs and links.
  • BeautifulSoup — a Python library for extracting data out of HTML documents and web pages.
  • stopwords — a part of nltk.corpus, provides us with a list of stopwords to work with.
  • PorterStemmer — a stemming algorithm inbuilt into NLTK.
  • WordNetLemmatizer — a lemmatizing algorithm inbuilt into NLTK.
  • pandas — python library for data analysis and plotting data frames.
  • pos_tag — part of speech tagger incorporated into NLTK to tag the given list of tokens.
  • io — library used for web data integration.

Extracting Text

After importing all the required dependencies, we then parse the given URL into BeautifulSoup() using the requests library so as to remove all the javascript and style tags and retain just the text content present on the webpage.

Stopword Removal

Now we will carry out stopword removal by checking the text obtained above for stopwords and removing the words that are present in both the text and the list of stopwords imported earlier.

Stemming and Lemmatization

We then carry out stemming and lemmatization of the text we get post stopword removal and store the outputs hence obtained in lists stem_words and lemmatize_words respectively.

Printing the output in the form of a data frame.

Output

Output data frame showing the frequency of the stemmed and lemmatized words.

Part 2

Add one new word to NLTK stopword list and filter the content extracted from the website in order to display the number of terms and their term frequency count after excluding the newly added stopword. Display the POS tag for all the stopwords, which are removed from the content.

Let us add the word ‘language’ to our list of stopwords and then filter the content based on this newly obtained list.

We then create a dictionary containing the words remaining after stopword removal as keys and their corresponding frequency of occurrence as values. By converting this dictionary into two subsequent lists, we can print the required output.

Displaying the output in the form of a data frame.

Output

Output data frame showing term frequency after adding a new stopword.

Part of Speech or POS tagging.

Printing out the POS tag for all the stopwords, which were removed from the text content.

Output

POS tags of removed stopwords.

Now that we have developed some basic familiarity with NLTK, let us take it up a notch by extracting, analysing and comparing the text content of two different web pages.

Part 3

Extract the contents from two websites (https://en.wikipedia.org/wiki/Natural_language_processing & https://en.wikipedia.org/wiki/Machine_learning) and save the content in two separate documents. Remove stopwords from the content and represent the documents using Boolean, Bag-of-words and Complete representation. Process a search query, compare the contents of both the pages for the existence of the query and display the similarity result based on highest matching count (bag-of-words).

Extracting Text

We first extract the text from the above two mentioned URLs via the very same procedure deployed above to remove javascript and other Html tags.

Stopword Removal

We then store the text so obtained in two separate doc files and perform stopword removal.

Boolean Representation

To formulate a boolean representation tabulating the presence of each word in the two documents, we traverse through the list of words obtained after performing stopword removal and check for the occurrence of each word. The output thus obtained is displayed in the form of a data frame.

Output

Output data frame showing the occurrence of words in the two documents.

Query Processing

The following code block asks the user to enter a query and looks for the occurrence of that particular keyword in both the documents. It displays ‘1’ if the required word is found in the document and ‘0’ otherwise.

Bag-Of-Words Representation

The code snippet below then counts the number of times each word occurs in both our documents and includes just the term frequencies, neglecting the term position altogether.

Output

Output data frame showing the frequency of words in both the documents.

As the position at which the words occur often holds importance, neglecting how to find the same, might not do justice to our objective of learning data pre-processing. Hence the following code snippet creates a data frame enlisting all the words and their respective position in Document 1.

Output

Output data frame showing the position of words in Document 1.

The very same process can then be repeated for the second document as well.

Output

Output data frame showing the position of words in Document 2.

Hope this article could drive home some basics related to data preprocessing and representation in NLP using NLTK. Here is the Github link containing the source code of the problems discussed above.

Please drop a clap [or maybe a follow? :)] if you found this informative and stay tuned for more upcoming articles.

--

--

Anmol Pant
CodeChef-VIT

Editorial Head at CodeChef-VIT | From tech to politics to everything in between.