Pre-processing of Topically Coherent Text Segments in Python 💬

How to use Natural Language Toolkit to pre-process a set of transcripts and convert them into numerical representations

Maziar Izadi

Published in

Analytics Vidhya

14 min readJan 15, 2020

Complete Jupyter Notebook and files are available on my GitHub page.

Introduction

Text documents, such as long recordings and meeting transcripts, are usually comprised of topically coherent text segments, each of which contains some number of text passages. Within each topically coherent segment, one would expect that the word usage demonstrates more consistent lexical distributions than that across segments. Natural Language Processing(NLP), more specifically, a linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR, document summarization, and discourse analysis. In the current exercise, we will review how to write a Python code to preprocess a set of transcripts and convert them into numerical representations suitable for input into topic segmentation algorithms.

The source of this article is derived from one of the assignments I fulfilled as part of my Data Science Graduate Diploma at Monash University. I have also made some changes to make the original tasks more interesting.

What can be the Use Case and how can NLP help?

Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting sites all manage a job search system, where job hunters could search for relevant jobs based on keywords, salary, and categories. Very often, the category of an advertised job is manually entered by the advertiser (e.g., the employer). There might be mistakes made for category assignment. As a result, the jobs in the wrong class will not get enough exposure to relevant candidate groups.

Advances in text analysis, automated job classification could become feasible and sensible suggestions for job categories can be made to potential advertisers. This can help reduce human data entry error, increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site. In order to do so, we need an automated job ads classification system that is trained on existing job advertisement data set with normalized job categories and predicts the class labels of newly entered job advertisements.

Current example touches on the first step in handling job advertisement text data, i.e., parsing the job advertisement text into a more appropriate format.

The job advertisement data that we provide contains a significant amount of redundant information represented in the simple txt format. We should properly preprocess the job advertisement text data to improve the
performance of classification algorithms.

Problem statement 💡

We are required to write Python code to extract a set of words (e.g., unigrams) that are indicative of the content of each job advertisement, and convert each advertisement description into a numeric representation: count vector that can be directly used as the input to many of the classification algorithms.

What are the steps we are going to take?

Extract the IDs and descriptions of all the job advertisements in
the data file data.txt (about 500 job advertisements).
Process and store these job advertisement text as sparse count vectors.

In order to achieve the above-mentioned, we will:

Exclude words with length less than 4
Remove stopwords using the provided stop words list (i.e, stopwords_en.txt )
Remove the words that appear only once in one job advertisement description, save them ( No duplication) as a txt file (refer to the required output)
Exclude those words in the generated vocabulary
Find the frequent words that appear in more than 100 advertisement
description, save them as a txt file (refer to the required output)
Exclude them in the generated vocabulary

We will not:

Generate multi-word phrases (i.e., collocations, Ngrams)

By the end of the exercise, we will have several outputs listed below including their requirements:

1. vocab.txt: it contains the unigram vocabulary in the following format:word_string:integer_index

Words in the vocabulary must be sorted in alphabetical order . This file is the key to interpret the sparse encoding. For instance, in the following example, word abbie is the 12th word (the corresponding integer_index = 11) in the vocabulary (note that numbers and words in the following are not indicative).

2. highFreq.txt This file contains frequent words that appear in more than 100 advertisement descriptions . In the output txt file, each line should contain only one word. The order of the unigrams is based on their frequency, i.e., the number of advertisements containing that word, from high to low.

3. lowFreq.txt This file contains the words that appear only once in one job advertisement description in the alphabetical order . In the output txt file, each line should contain one word.

4.sparse.txt Each line of this file corresponds to one advertisement. So, they start with advertisement ID . The rest of each line is the sparse representation of the corresponding description in the form of word_index:word_freq separated by a comma. The order of the lines must match the order of the advertisements in the input file.

Note: word_freq here refers to the frequency of the unigram in the corresponding description rather than the whole document. For example, word number 11 (which is ‘abbie’ according to the above example) appears exactly once in the description of advertisement 12612628 (numbers are not indicative) :

Solution ⛳️

So we always begin with importing required libraries. Owing to the nature of this exercise, followings are required:

Import libraries

Regular Expressions

First one is the Regular Expressions which they call ReGex in a short format. If you haven’t used them, I strongly suggest you pick it up and get some cool stuff done. Further down, I’ve provided some details to start with.

# Regular Expressions (ReGeX)import re

Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

nltk.probability provides classes for representing and processing probabilistic information such as FreqDist which we’ll use later.

# Natural Language Toolkitimport nltkfrom nltk.probability import *from nltk.corpus import stopwords

Itertools

The Python itertools module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for loop. The most common iterator in Python is the list.

# Functions creating iterators for efficient loopingimport itertoolsfrom itertools import chainfrom itertools import groupby

Let’s write some code 🔥

We import data first. File named data.txt is on GitHub for your reference. I have it saved on my local computer in the same folder as my Jupyter Notebook file.

Before reading the file, we define an empty list and call it data, for convenience.

data = []

Then we simply read data.txt and save it in the list data. Make sure you define the encoding format utf8 , otherwise you might get an error.

Sample error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 260893: character maps to <undefined>

Another consideration is that we directly convert the text into lower for consistency purpose using .lower() function.

with open('data.txt', encoding="utf8") as f:
    data = f.read().lower()

Formatting and cleansing ✂️ 🔨 📌

Now, we need to begin the process of tokenizing the text. The task of breaking a character sequence into pieces is known as tokenization.

Firstly, we have to remove all the noise such as /-*#@ or any other non-word character or extra spaces from the text and we do that with the powerful ReGex tool.

In order to run the formatting using ReGex, there are two steps you need to take;

(1) Create the pattern,

(2) Run the pattern using Python code and find the matches.

# (1) create a pattern for REGEX to find and keep matching words onlypattern = re.compile(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")# (2)tokenise the words: match the pattern to file's content 
# and tokenize the contenttokenised = pattern.findall(data)

There are heaps of online resource on Regex but the one I found the most interesting was https://regex101.com/. Not only does it help you match your text with the pattern, but also it provides short and sweet informative content. In Picture 1, I’ve provided a brief list of functionalities they provide on their page.

Picture 1, Functionalities provided by `regex101.com`

Useful Regex Resources for Python:

Indexing the tokenized list 📇

Now, I’ve indexed the tokens based on id and title in each job ad:

# pass the length of the 'tokenised' series into a variabletokenised_len = len(tokenised)
# indexing the tokens based on the position of "id" and "title"indexes = [i for i, v in enumerate(tokenised) if v=='id' and i+1 < tokenised_len and tokenised[i+1]=='title']

Next, we create a function from itertools recipes which iterates through the list of tokens and creates sub list to include tokens related to one job ad only. The output will be a data dictionary.

# from itertools recipes
def pairwise(iterable, fillvalue=None):
    """
       This function iterates through the list of tokens and 
       creates sub list to include tokens related to one job ad only
    """
    a, b = iter(iterable), iter(iterable)
    next(b, None)
    return itertools.zip_longest(a, b, fillvalue=fillvalue)
# pairwise based on indexes in the last block and store in the 'tokenised' as a listtokenised = [tokenised[i:j] for i,j in pairwise(indexes)]

To create the data dict, I used Python itertools. Jason Rigdel has introduced a good explanation and a set of examples on the topic of itertools in Python.

A Guide to Python Itertools

However, the list contains a lot of functional words, such as “to”, “in”, “the”, “is” and so on.

These functional words usually do not contribute much to the semantics of the text, except for increase the dimensionality of the data in text analysis.
Also, note that our goal, usually, is to build a prediction classification model. Thus, we are more interested in the meaning of the report than the syntax. Therefore, we can choose to remove those words, which is your next task.

I will exclude all tokens less than 4 characters by keeping those which include more than 3 characters and append the rest to a list I named to_remove. This list will be added to a generic English stopwords list later.

tokenised = [[word if len(word) > 3 else "to_remove" for word in job] for job in tokenised]

Removing StopWords ✂️

Stop words carry little lexical content.

They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. Otherwise, we will face the curse of dimensionality.

There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, you are going to remove all the stop words in the above list by using the stop word list in NLTK, which is:

nltk.download('stopwords')stopwords_list = stopwords.words('english')

For this example, I have already provided the stopwords_en.txt file on my GitHub where you can download. We add to_remove list that we created above first to stopwords_en.txt file, read the file, and then save them as a set().

# adding'to_removed' string to the list of stopwordsstopwords = []with open('stopwords_en.txt',"a") as f:
    f.write("\nto_remove") #\n to shift to next line
with open('stopwords_en.txt') as f:
    stopwords = f.read().splitlines() #reading stopwords line and create stopwords as a list# convert stopwords into setstopwordsset = set(stopwords)

You might be wondering why we saved stopwords as a set . That’s a good question…Python set is better choice than a list because set runs much faster than lists in terms of searching a large number of hashable items.

Next, I’ve created a function called purifier() which essentially purifies tokens by removing stopwords , and then run the tokenised list through.

def purifier(tokenList,remove_token):
    """
        This function takes two input (list of current tokens 
        and list of tokens to be removed)
        The function converts the list into set to improve the 
        performance
        and return a list of sets each of which include purified 
        tokens and remove_token lists are removed
    """
    return [set(word for word in job if word not in remove_token) for job in tokenList]# running the 'purifier' functiontokenised = purifier(tokenised,stopwordsset)

Next is to remove the words that appear only once in one job advertisement description, save them ( No duplication) as a txt file (refer to the required output). In order to do that you will need to exclude those words in the generated vocabulary.

To do that, we begin by using chain() function to join all the words in all the job ads together by making a list. In “A Guide to Python Itertools”, there’s a good explanation on how chain() function works.

stop_wrds_removed_words = list(chain.from_iterable([word for word in job] for job in tokenised))

Convert the list of words into set to remove duplicates and create the set of vocabulary

stop_wrds_removed_vocab = set(stop_wrds_removed_words)

Next is passing the words in FreqDisrt() function to count the number of token.

The FreqDist class is used to encode “frequency distributions”, which count the number of times that each outcome of an experiment occurs. It is one of the classes under nltk.probability Module.
According to developedia, typically, each text corpus is a collection of text sources. There are dozens of such corpora for a variety of NLP tasks. This article ignores speech corpora and considers only those in text form. In our example, text corpus refers to the combination of all job ads together (… and not each job separately).

The following line function counts the number of times a word occurs in the whole corpus regardless which ad it is in.

fd = FreqDist(stop_wrds_removed_words)

Low Frequent Tokens

To find the Less Frequent Tokens, I create a list of tokens which have occurred once only and convert the list into set for performance improvement.

once_only = set([k for k, v in fd.items() if v == 1])# sort the set into alphabetical orderonce_only = sorted(once_only)
set(once_only)

To create the lowFreq.txt file, I’ve saved the sorted set of the words that appear “once only” in one job advertisement description into a file with the same name.

out_file = open("lowFreq.txt", 'w')
for d in once_only:
    out_file.write(''.join(d) + '\n')
out_file.close()

High Frequent Tokens

At this stage, I repeat the same steps above, however, this time the intention is to find the high frequent words and save them in file named highFreq.txt.
I start by removing lowFreq tokens from the list of tokens by running the purifier() function which we defined earlier.

tokenised = purifier(tokenised,once_only)

Next is to create a new list of words after once_only words are removed.

LowFreqRemoved_Words = list(chain.from_iterable([word for word in job] for job in tokenised))
LowFreqRemoved_vocab = set(LowFreqRemoved_Words)
LowFreqRemoved_fd = FreqDist(LowFreqRemoved_Words)

For high frequent words, I have selected 100 threshold. You can definitely choose any threshold accordingly to the context of your work.

highFreq = set([k for k, v in LowFreqRemoved_fd.items() if v > 100])

Now, going to save the sorted list of high frequent words that appear in more than 100 job advertisement description to a file.

out_file = open("highFreq.txt", 'w')
for d in highFreq:
    out_file.write(''.join(d) + '\n')
out_file.close()

Once again, we run the purifier() function to remove highFreq data set and create a new list.

tokenised = purifier(tokenised,highFreq)
HighFreqRemoved_words = list(chain.from_iterable([word for word in job] for job in tokenised))HighFreqRemoved_vocab = set(HighFreqRemoved_words)

Note

You might be wondering what’s the difference between words and vocab lists in my code and why every time I created a words list, then a vocab is also created. The reason goes back to the difference between list and set in Python. Yet bottom line is that in vocab each word is listed only once while words might have duplicates.

Next is just a simple checkpoint to see the progress of purification of tokens:

print(f"Length of words: {len(stop_wrds_removed_words)}")print(f"Length of vocab: {len(stop_wrds_removed_vocab)}")print(f"Length of LowFreqRemoved_Words: {len(LowFreqRemoved_Words)}")print(f"Length of LowFreqRemoved_vocab: {len(LowFreqRemoved_vocab)}")print(f"Length of HighFreqRemoved_words: {len(HighFreqRemoved_words)}")print(f"Length of HighFreqRemoved_vocab: {len(HighFreqRemoved_vocab)}")

…which gives us the following output:

Length of words: 474345
Length of vocab: 18619
Length of LowFreqRemoved_Words: 465779
Length of LowFreqRemoved_vocab: 10053
Length of HighFreqRemoved_words: 126491
Length of HighFreqRemoved_vocab: 9103

Next is creating a file of all vocabs named vocab.txt.

HighFreqRemoved_vocab = list(HighFreqRemoved_vocab)
# list of final vocabsvocab = {HighFreqRemoved_vocab[i]:i for i in range(0,len(HighFreqRemoved_vocab))}

building a function to create the vocab.txt file and finally build and sort the file by calling the functions:

def vaocab_output(file):
    with open (file, "a") as f:
        for key in sorted(vocab.keys()):
            f.write("%s:%s\n" % (key, vocab[key]))# calling the function to build the filevaocab_output("vocab.txt")

vocab.txt output:

Until this point, I tried to keep the code as simple as possible for practice purpose. However, for this step, I created a bit more complex, yet more efficient, piece of code. Please leave any questions in comments and I make sure they are answered.

The final activity is to sparse representation of the corresponding description in the form of word_index:word_freq separated by a comma and create the file sparse.txt.

data = {}
id = None
with open('data.txt', 'r',encoding="utf8") as f:
    for i, line in enumerate(f): # create the iteration in the range of imported file's length
        line = line.lower() 
        line = line.strip()
        if not line:
            continue
        section = line.split(':')[0] # define 'section' as a method to manupilate each line based on how the line begins
        content = ':'.join(line.split(':')[1:]).strip() # define 'content' a method to capture tokens
        if section == 'id': # id section:
            if id: # Error handle if theres some bad formatting: multiple ids
                raise ValueError('unable to parse file at line %d, multiple ids' % i)
            id = content[1:] # capture the job id
            if id in data.keys():# Error handle if theres some bad formatting: duplicates
                raise ValueError('unable to parse file at line %d, duplicate id' % i)
        elif section == 'description': #capture job description per each job ad
            if not id:# Error handle if theres some bad formatting: missing id
                raise ValueError('unable to parse file at line %d, missing id' % i)
            content = pattern.findall(line)
            content = [value for value in content if len(value) > 3] # remove short character token
            content = [value for value in content if value not in stopwordsset] # remove stopwords
            content = [value for value in content if value not in once_only] # remove lowFreq token
            content = [value for value in content if value not in highFreq] # remove highFreq tokens
            data[id] = content # creates data dictionary
            id = None
        elif section == 'title': # if the line start with 'title' do nothing
            continue
        else:
            raise ValueError('unable to parse file at line %d, unexpected section name' % i)

And finally build the sparse.txt file.

with open('sparse.txt',"w") as f:
    for jobID,content in data.items(): # go through data dictionary created in the last block
        fd_parse = FreqDist(content) # count number of times each token occured in the same job ad
        tmp = "" # create a placeholder for word_index:word_freq
        for (x,y) in fd_parse.items(): # iterate through each frequencies
            tmp += f"{vocab[x]}:{y}," # build the dictionary of word_index:word_freq in the placeholder
        f.write(f"#{jobID},{tmp[:-1]}\n") # write in the file line by line

Output of sparse.txt :