Pre-processing of Topically Coherent Text Segments in Python 💬

How to use Natural Language Toolkit to pre-process a set of transcripts and convert them into numerical representations

Maziar Izadi
Analytics Vidhya
14 min readJan 15, 2020

--

Complete Jupyter Notebook and files are available on my GitHub page.

Introduction

Text documents, such as long recordings and meeting transcripts, are usually comprised of topically coherent text segments, each of which contains some number of text passages. Within each topically coherent segment, one would expect that the word usage demonstrates more consistent lexical distributions than that across segments. Natural Language Processing(NLP), more specifically, a linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR, document summarization, and discourse analysis. In the current exercise, we will review how to write a Python code to preprocess a set of transcripts and convert them into numerical representations suitable for input into topic segmentation algorithms.

Source of the image

The source of this article is derived from one of the assignments I fulfilled as part of my Data Science Graduate Diploma at Monash University. I have also made some changes to make the original tasks more interesting.

What can be the Use Case and how can NLP help?

Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting sites all manage a job search system, where job hunters could search for relevant jobs based on keywords, salary, and categories. Very often, the category of an advertised job is manually entered by the advertiser (e.g., the employer). There might be mistakes made for category assignment. As a result, the jobs in the wrong class will not get enough exposure to relevant candidate groups.

Advances in text analysis, automated job classification could become feasible and sensible suggestions for job categories can be made to potential advertisers. This can help reduce human data entry error, increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site. In order to do so, we need an automated job ads classification system that is trained on existing job advertisement data set with normalized job categories and predicts the class labels of newly entered job advertisements.

Current example touches on the first step in handling job advertisement text data, i.e., parsing the job advertisement text into a more appropriate format.

The job advertisement data that we provide contains a significant amount of redundant information represented in the simple txt format. We should properly preprocess the job advertisement text data to improve the
performance of classification algorithms.

Problem statement 💡

We are required to write Python code to extract a set of words (e.g., unigrams) that are indicative of the content of each job advertisement, and convert each advertisement description into a numeric representation: count vector that can be directly used as the input to many of the classification algorithms.

What are the steps we are going to take?

  • Extract the IDs and descriptions of all the job advertisements in
    the data file data.txt (about 500 job advertisements).
  • Process and store these job advertisement text as sparse count vectors.

In order to achieve the above-mentioned, we will:

  • Exclude words with length less than 4
  • Remove stopwords using the provided stop words list (i.e, stopwords_en.txt )
  • Remove the words that appear only once in one job advertisement description, save them ( No duplication) as a txt file (refer to the required output)
  • Exclude those words in the generated vocabulary
  • Find the frequent words that appear in more than 100 advertisement
    description, save them as a txt file (refer to the required output)
  • Exclude them in the generated vocabulary

We will not:

  • Generate multi-word phrases (i.e., collocations, Ngrams)

By the end of the exercise, we will have several outputs listed below including their requirements:

1. vocab.txt: it contains the unigram vocabulary in the following format:word_string:integer_index

  • Words in the vocabulary must be sorted in alphabetical order . This file is the key to interpret the sparse encoding. For instance, in the following example, word abbie is the 12th word (the corresponding integer_index = 11) in the vocabulary (note that numbers and words in the following are not indicative).
vocab.txt file output format

2. highFreq.txt This file contains frequent words that appear in more than 100 advertisement descriptions . In the output txt file, each line should contain only one word. The order of the unigrams is based on their frequency, i.e., the number of advertisements containing that word, from high to low.

3. lowFreq.txt This file contains the words that appear only once in one job advertisement description in the alphabetical order . In the output txt file, each line should contain one word.

4.sparse.txt Each line of this file corresponds to one advertisement. So, they start with advertisement ID . The rest of each line is the sparse representation of the corresponding description in the form of word_index:word_freq separated by a comma. The order of the lines must match the order of the advertisements in the input file.

Note: word_freq here refers to the frequency of the unigram in the corresponding description rather than the whole document. For example, word number 11 (which is ‘abbie’ according to the above example) appears exactly once in the description of advertisement 12612628 (numbers are not indicative) :

sparse.txt file output format

Solution ⛳️

So we always begin with importing required libraries. Owing to the nature of this exercise, followings are required:

Import libraries

  • Regular Expressions

First one is the Regular Expressions which they call ReGex in a short format. If you haven’t used them, I strongly suggest you pick it up and get some cool stuff done. Further down, I’ve provided some details to start with.

  • Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

nltk.probability provides classes for representing and processing probabilistic information such as FreqDist which we’ll use later.

  • Itertools

The Python itertools module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for loop. The most common iterator in Python is the list.

Let’s write some code 🔥

We import data first. File named data.txt is on GitHub for your reference. I have it saved on my local computer in the same folder as my Jupyter Notebook file.

Before reading the file, we define an empty list and call it data, for convenience.

Then we simply read data.txt and save it in the list data. Make sure you define the encoding format utf8 , otherwise you might get an error.

  • Sample error:

Another consideration is that we directly convert the text into lower for consistency purpose using .lower() function.

Formatting and cleansing ✂️ 🔨 📌

Now, we need to begin the process of tokenizing the text. The task of breaking a character sequence into pieces is known as tokenization.

Firstly, we have to remove all the noise such as /-*#@ or any other non-word character or extra spaces from the text and we do that with the powerful ReGex tool.

In order to run the formatting using ReGex, there are two steps you need to take;

(1) Create the pattern,

(2) Run the pattern using Python code and find the matches.

Image source

There are heaps of online resource on Regex but the one I found the most interesting was https://regex101.com/. Not only does it help you match your text with the pattern, but also it provides short and sweet informative content. In Picture 1, I’ve provided a brief list of functionalities they provide on their page.

Picture 1, Functionalities provided by regex101.com

Useful Regex Resources for Python:

Indexing the tokenized list 📇

Now, I’ve indexed the tokens based on id and title in each job ad:

Next, we create a function from itertools recipes which iterates through the list of tokens and creates sub list to include tokens related to one job ad only. The output will be a data dictionary.

To create the data dict, I used Python itertools. Jason Rigdel has introduced a good explanation and a set of examples on the topic of itertools in Python.

However, the list contains a lot of functional words, such as “to”, “in”, “the”, “is” and so on.

These functional words usually do not contribute much to the semantics of the text, except for increase the dimensionality of the data in text analysis.

Also, note that our goal, usually, is to build a prediction classification model. Thus, we are more interested in the meaning of the report than the syntax. Therefore, we can choose to remove those words, which is your next task.

I will exclude all tokens less than 4 characters by keeping those which include more than 3 characters and append the rest to a list I named to_remove. This list will be added to a generic English stopwords list later.

Removing StopWords ✂️

Stop words carry little lexical content.

They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. Otherwise, we will face the curse of dimensionality.

There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, you are going to remove all the stop words in the above list by using the stop word list in NLTK, which is:

For this example, I have already provided the stopwords_en.txt file on my GitHub where you can download. We add to_remove list that we created above first to stopwords_en.txt file, read the file, and then save them as a set().

You might be wondering why we saved stopwords as a set . That’s a good question…Python set is better choice than a list because set runs much faster than lists in terms of searching a large number of hashable items.

Next, I’ve created a function called purifier() which essentially purifies tokens by removing stopwords , and then run the tokenised list through.

Pic from shutterstock library

Next is to remove the words that appear only once in one job advertisement description, save them ( No duplication) as a txt file (refer to the required output). In order to do that you will need to exclude those words in the generated vocabulary.

To do that, we begin by using chain() function to join all the words in all the job ads together by making a list. In “A Guide to Python Itertools”, there’s a good explanation on how chain() function works.

Convert the list of words into set to remove duplicates and create the set of vocabulary

Next is passing the words in FreqDisrt() function to count the number of token.

The FreqDist class is used to encode “frequency distributions”, which count the number of times that each outcome of an experiment occurs. It is one of the classes under nltk.probability Module.

According to developedia, typically, each text corpus is a collection of text sources. There are dozens of such corpora for a variety of NLP tasks. This article ignores speech corpora and considers only those in text form. In our example, text corpus refers to the combination of all job ads together (… and not each job separately).

The following line function counts the number of times a word occurs in the whole corpus regardless which ad it is in.

Low Frequent Tokens

To find the Less Frequent Tokens, I create a list of tokens which have occurred once only and convert the list into set for performance improvement.

To create the lowFreq.txt file, I’ve saved the sorted set of the words that appear “once only” in one job advertisement description into a file with the same name.

High Frequent Tokens

At this stage, I repeat the same steps above, however, this time the intention is to find the high frequent words and save them in file named highFreq.txt.
I start by removing lowFreq tokens from the list of tokens by running the purifier() function which we defined earlier.

Next is to create a new list of words after once_only words are removed.

For high frequent words, I have selected 100 threshold. You can definitely choose any threshold accordingly to the context of your work.

Now, going to save the sorted list of high frequent words that appear in more than 100 job advertisement description to a file.

Once again, we run the purifier() function to remove highFreq data set and create a new list.

Note

You might be wondering what’s the difference between words and vocab lists in my code and why every time I created a words list, then a vocab is also created. The reason goes back to the difference between list and set in Python. Yet bottom line is that in vocab each word is listed only once while words might have duplicates.

Next is just a simple checkpoint to see the progress of purification of tokens:

…which gives us the following output:

Next is creating a file of all vocabs named vocab.txt.

building a function to create the vocab.txt file and finally build and sort the file by calling the functions:

vocab.txt output:

Until this point, I tried to keep the code as simple as possible for practice purpose. However, for this step, I created a bit more complex, yet more efficient, piece of code. Please leave any questions in comments and I make sure they are answered.

The final activity is to sparse representation of the corresponding description in the form of word_index:word_freq separated by a comma and create the file sparse.txt.

And finally build the sparse.txt file.

Output of sparse.txt :

— -end — -

--

--

Maziar Izadi
Analytics Vidhya

I set goals ambitiously…I take actions quickly…I write…to learn…I play music… to meditate. https://www.linkedin.com/in/maziarizadi/