Pre-processing of Topically Coherent Text Segments in Python 💬
How to use Natural Language Toolkit to pre-process a set of transcripts and convert them into numerical representations
Complete Jupyter Notebook and files are available on my GitHub page.
Introduction
Text documents, such as long recordings and meeting transcripts, are usually comprised of topically coherent text segments, each of which contains some number of text passages. Within each topically coherent segment, one would expect that the word usage demonstrates more consistent lexical distributions than that across segments. Natural Language Processing(NLP), more specifically, a linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR, document summarization, and discourse analysis. In the current exercise, we will review how to write a Python code to preprocess a set of transcripts and convert them into numerical representations suitable for input into topic segmentation algorithms.
The source of this article is derived from one of the assignments I fulfilled as part of my Data Science Graduate Diploma at Monash University. I have also made some changes to make the original tasks more interesting.
What can be the Use Case and how can NLP help?
Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting sites all manage a job search system, where job hunters could search for relevant jobs based on keywords, salary, and categories. Very often, the category of an advertised job is manually entered by the advertiser (e.g., the employer). There might be mistakes made for category assignment. As a result, the jobs in the wrong class will not get enough exposure to relevant candidate groups.
Advances in text analysis, automated job classification could become feasible and sensible suggestions for job categories can be made to potential advertisers. This can help reduce human data entry error, increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site. In order to do so, we need an automated job ads classification system that is trained on existing job advertisement data set with normalized job categories and predicts the class labels of newly entered job advertisements.
Current example touches on the first step in handling job advertisement text data, i.e., parsing the job advertisement text into a more appropriate format.
The job advertisement data that we provide contains a significant amount of redundant information represented in the simple txt format. We should properly preprocess the job advertisement text data to improve the
performance of classification algorithms.
Problem statement 💡
We are required to write Python code to extract a set of words (e.g., unigrams) that are indicative of the content of each job advertisement, and convert each advertisement description into a numeric representation: count vector that can be directly used as the input to many of the classification algorithms.
What are the steps we are going to take?
- Extract the IDs and descriptions of all the job advertisements in
the data filedata.txt
(about 500 job advertisements). - Process and store these job advertisement text as sparse count vectors.
In order to achieve the above-mentioned, we will:
- Exclude words with length less than 4
- Remove stopwords using the provided stop words list (i.e, stopwords_en.txt )
- Remove the words that appear only once in one job advertisement description, save them ( No duplication) as a
txt
file (refer to the required output) - Exclude those words in the generated vocabulary
- Find the frequent words that appear in more than 100 advertisement
description, save them as atxt
file (refer to the required output) - Exclude them in the generated vocabulary
We will not:
- Generate multi-word phrases (i.e., collocations, Ngrams)
By the end of the exercise, we will have several outputs listed below including their requirements:
1. vocab.txt
: it contains the unigram vocabulary in the following format:word_string:integer_index
- Words in the vocabulary must be sorted in alphabetical order . This file is the key to interpret the sparse encoding. For instance, in the following example, word abbie is the 12th word (the corresponding integer_index = 11) in the vocabulary (note that numbers and words in the following are not indicative).
2. highFreq.txt
This file contains frequent words that appear in more than 100 advertisement descriptions . In the output txt
file, each line should contain only one word. The order of the unigrams is based on their frequency, i.e., the number of advertisements containing that word, from high to low.
3. lowFreq.txt
This file contains the words that appear only once in one job advertisement description in the alphabetical order . In the output txt
file, each line should contain one word.
4.sparse.txt
Each line of this file corresponds to one advertisement. So, they start with advertisement ID
. The rest of each line is the sparse representation of the corresponding description in the form of word_index:word_freq
separated by a comma. The order of the lines must match the order of the advertisements in the input file.
Note: word_freq
here refers to the frequency of the unigram in the corresponding description rather than the whole document. For example, word number 11 (which is ‘abbie’ according to the above example) appears exactly once in the description of advertisement 12612628 (numbers are not indicative) :
Solution ⛳️
So we always begin with importing required libraries. Owing to the nature of this exercise, followings are required:
Import libraries
- Regular Expressions
First one is the Regular Expressions which they call ReGex in a short format. If you haven’t used them, I strongly suggest you pick it up and get some cool stuff done. Further down, I’ve provided some details to start with.
# Regular Expressions (ReGeX)import re
- Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
nltk.probability
provides classes for representing and processing probabilistic information such as FreqDist
which we’ll use later.
# Natural Language Toolkitimport nltkfrom nltk.probability import *from nltk.corpus import stopwords
- Itertools
The Python itertools
module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for
loop. The most common iterator in Python is the list.
# Functions creating iterators for efficient loopingimport itertoolsfrom itertools import chainfrom itertools import groupby
Let’s write some code 🔥
We import data first. File named data.txt
is on GitHub for your reference. I have it saved on my local computer in the same folder as my Jupyter Notebook file.
Before reading the file, we define an empty list and call it data
, for convenience.
data = []
Then we simply read data.txt
and save it in the list data
. Make sure you define the encoding format utf8
, otherwise you might get an error.
- Sample error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 260893: character maps to <undefined>
Another consideration is that we directly convert the text into lower for consistency purpose using .lower()
function.
with open('data.txt', encoding="utf8") as f:
data = f.read().lower()
Formatting and cleansing ✂️ 🔨 📌
Now, we need to begin the process of tokenizing the text. The task of breaking a character sequence into pieces is known as tokenization.
Firstly, we have to remove all the noise such as /-*#@ or any other non-word character or extra spaces from the text and we do that with the powerful ReGex
tool.
In order to run the formatting using ReGex, there are two steps you need to take;
(1) Create the pattern,
(2) Run the pattern using Python code and find the matches.
# (1) create a pattern for REGEX to find and keep matching words onlypattern = re.compile(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")# (2)tokenise the words: match the pattern to file's content
# and tokenize the contenttokenised = pattern.findall(data)
There are heaps of online resource on Regex but the one I found the most interesting was https://regex101.com/. Not only does it help you match your text with the pattern, but also it provides short and sweet informative content. In Picture 1, I’ve provided a brief list of functionalities they provide on their page.
Useful Regex Resources for Python:
- Python Regex Tutorial for Data Science
- Python 3 re module documentation
- Online regex tester and debugger
Indexing the tokenized list 📇
Now, I’ve indexed the tokens based on id
and title
in each job ad:
# pass the length of the 'tokenised' series into a variabletokenised_len = len(tokenised)
# indexing the tokens based on the position of "id" and "title"indexes = [i for i, v in enumerate(tokenised) if v=='id' and i+1 < tokenised_len and tokenised[i+1]=='title']
Next, we create a function from itertools
recipes which iterates through the list of tokens and creates sub list to include tokens related to one job ad only. The output will be a data dictionary.
# from itertools recipes
def pairwise(iterable, fillvalue=None):
"""
This function iterates through the list of tokens and
creates sub list to include tokens related to one job ad only
"""
a, b = iter(iterable), iter(iterable)
next(b, None)
return itertools.zip_longest(a, b, fillvalue=fillvalue)
# pairwise based on indexes in the last block and store in the 'tokenised' as a listtokenised = [tokenised[i:j] for i,j in pairwise(indexes)]
To create the data dict, I used Python itertools
. Jason Rigdel has introduced a good explanation and a set of examples on the topic of itertools
in Python.
However, the list contains a lot of functional words, such as “to”, “in”, “the”, “is” and so on.
These functional words usually do not contribute much to the semantics of the text, except for increase the dimensionality of the data in text analysis.
Also, note that our goal, usually, is to build a prediction classification model. Thus, we are more interested in the meaning of the report than the syntax. Therefore, we can choose to remove those words, which is your next task.
I will exclude all tokens less than 4 characters by keeping those which include more than 3 characters and append the rest to a list I named to_remove
. This list will be added to a generic English stopwords
list later.
tokenised = [[word if len(word) > 3 else "to_remove" for word in job] for job in tokenised]
Removing StopWords ✂️
Stop words carry little lexical content.
They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. Otherwise, we will face the curse of dimensionality.
There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, you are going to remove all the stop words in the above list by using the stop word list in NLTK, which is:
nltk.download('stopwords')stopwords_list = stopwords.words('english')
For this example, I have already provided the stopwords_en.txt
file on my GitHub where you can download. We add to_remove
list that we created above first to stopwords_en.txt
file, read the file, and then save them as a set()
.
# adding'to_removed' string to the list of stopwordsstopwords = []with open('stopwords_en.txt',"a") as f:
f.write("\nto_remove") #\n to shift to next line
with open('stopwords_en.txt') as f:
stopwords = f.read().splitlines() #reading stopwords line and create stopwords as a list# convert stopwords into setstopwordsset = set(stopwords)
You might be wondering why we saved stopwords
as a set
. That’s a good question…Python set
is better choice than a list
because set
runs much faster than lists in terms of searching a large number of hashable items.
Next, I’ve created a function called purifier()
which essentially purifies tokens by removing stopwords
, and then run the tokenised
list through.
def purifier(tokenList,remove_token):
"""
This function takes two input (list of current tokens
and list of tokens to be removed)
The function converts the list into set to improve the
performance
and return a list of sets each of which include purified
tokens and remove_token lists are removed
"""
return [set(word for word in job if word not in remove_token) for job in tokenList]# running the 'purifier' functiontokenised = purifier(tokenised,stopwordsset)
Next is to remove the words
that appear only once in one job advertisement description, save them ( No duplication) as a txt file (refer to the required output). In order to do that you will need to exclude those words in the generated vocabulary.
To do that, we begin by using chain()
function to join all the words in all the job ads together by making a list. In “A Guide to Python Itertools”, there’s a good explanation on how chain()
function works.
stop_wrds_removed_words = list(chain.from_iterable([word for word in job] for job in tokenised))
Convert the list of words into set to remove duplicates and create the set of vocabulary
stop_wrds_removed_vocab = set(stop_wrds_removed_words)
Next is passing the words in FreqDisrt()
function to count the number of token.
The FreqDist class is used to encode “frequency distributions”, which count the number of times that each outcome of an experiment occurs. It is one of the classes under
nltk.probability
Module.According to developedia, typically, each text corpus is a collection of text sources. There are dozens of such corpora for a variety of NLP tasks. This article ignores speech corpora and considers only those in text form. In our example, text corpus refers to the combination of all job ads together (… and not each job separately).
The following line function counts the number of times a word occurs in the whole corpus
regardless which ad it is in.
fd = FreqDist(stop_wrds_removed_words)
Low Frequent Tokens
To find the Less Frequent Tokens, I create a list of tokens which have occurred once only and convert the list into set
for performance improvement.
once_only = set([k for k, v in fd.items() if v == 1])# sort the set into alphabetical orderonce_only = sorted(once_only)
set(once_only)
To create the lowFreq.txt
file, I’ve saved the sorted set
of the words that appear “once only” in one job advertisement description into a file with the same name.
out_file = open("lowFreq.txt", 'w')
for d in once_only:
out_file.write(''.join(d) + '\n')
out_file.close()
High Frequent Tokens
At this stage, I repeat the same steps above, however, this time the intention is to find the high frequent words and save them in file named highFreq.txt
.
I start by removing lowFreq
tokens from the list
of tokens by running the purifier()
function which we defined earlier.
tokenised = purifier(tokenised,once_only)
Next is to create a new list
of words after once_only
words are removed.
LowFreqRemoved_Words = list(chain.from_iterable([word for word in job] for job in tokenised))
LowFreqRemoved_vocab = set(LowFreqRemoved_Words)
LowFreqRemoved_fd = FreqDist(LowFreqRemoved_Words)
For high frequent words, I have selected 100 threshold. You can definitely choose any threshold accordingly to the context of your work.
highFreq = set([k for k, v in LowFreqRemoved_fd.items() if v > 100])
Now, going to save the sorted list of high frequent words that appear in more than 100 job advertisement description to a file.
out_file = open("highFreq.txt", 'w')
for d in highFreq:
out_file.write(''.join(d) + '\n')
out_file.close()
Once again, we run the purifier()
function to remove highFreq
data set and create a new list
.
tokenised = purifier(tokenised,highFreq)
HighFreqRemoved_words = list(chain.from_iterable([word for word in job] for job in tokenised))HighFreqRemoved_vocab = set(HighFreqRemoved_words)
Note
You might be wondering what’s the difference between words
and vocab
lists in my code and why every time I created a words
list, then a vocab
is also created. The reason goes back to the difference between list
and set
in Python. Yet bottom line is that in vocab
each word is listed only once while words
might have duplicates.
Next is just a simple checkpoint to see the progress of purification of tokens:
print(f"Length of words: {len(stop_wrds_removed_words)}")print(f"Length of vocab: {len(stop_wrds_removed_vocab)}")print(f"Length of LowFreqRemoved_Words: {len(LowFreqRemoved_Words)}")print(f"Length of LowFreqRemoved_vocab: {len(LowFreqRemoved_vocab)}")print(f"Length of HighFreqRemoved_words: {len(HighFreqRemoved_words)}")print(f"Length of HighFreqRemoved_vocab: {len(HighFreqRemoved_vocab)}")
…which gives us the following output:
Length of words: 474345
Length of vocab: 18619
Length of LowFreqRemoved_Words: 465779
Length of LowFreqRemoved_vocab: 10053
Length of HighFreqRemoved_words: 126491
Length of HighFreqRemoved_vocab: 9103
Next is creating a file of all vocabs named vocab.txt
.
HighFreqRemoved_vocab = list(HighFreqRemoved_vocab)
# list of final vocabsvocab = {HighFreqRemoved_vocab[i]:i for i in range(0,len(HighFreqRemoved_vocab))}
building a function to create the vocab.txt
file and finally build and sort the file by calling the functions:
def vaocab_output(file):
with open (file, "a") as f:
for key in sorted(vocab.keys()):
f.write("%s:%s\n" % (key, vocab[key]))# calling the function to build the filevaocab_output("vocab.txt")
vocab.txt
output:
Until this point, I tried to keep the code as simple as possible for practice purpose. However, for this step, I created a bit more complex, yet more efficient, piece of code. Please leave any questions in comments and I make sure they are answered.
The final activity is to sparse representation of the corresponding description in the form of word_index:word_freq
separated by a comma and create the file sparse.txt
.
data = {}
id = None
with open('data.txt', 'r',encoding="utf8") as f:
for i, line in enumerate(f): # create the iteration in the range of imported file's length
line = line.lower()
line = line.strip()
if not line:
continue
section = line.split(':')[0] # define 'section' as a method to manupilate each line based on how the line begins
content = ':'.join(line.split(':')[1:]).strip() # define 'content' a method to capture tokens
if section == 'id': # id section:
if id: # Error handle if theres some bad formatting: multiple ids
raise ValueError('unable to parse file at line %d, multiple ids' % i)
id = content[1:] # capture the job id
if id in data.keys():# Error handle if theres some bad formatting: duplicates
raise ValueError('unable to parse file at line %d, duplicate id' % i)
elif section == 'description': #capture job description per each job ad
if not id:# Error handle if theres some bad formatting: missing id
raise ValueError('unable to parse file at line %d, missing id' % i)
content = pattern.findall(line)
content = [value for value in content if len(value) > 3] # remove short character token
content = [value for value in content if value not in stopwordsset] # remove stopwords
content = [value for value in content if value not in once_only] # remove lowFreq token
content = [value for value in content if value not in highFreq] # remove highFreq tokens
data[id] = content # creates data dictionary
id = None
elif section == 'title': # if the line start with 'title' do nothing
continue
else:
raise ValueError('unable to parse file at line %d, unexpected section name' % i)
And finally build the sparse.txt
file.
with open('sparse.txt',"w") as f:
for jobID,content in data.items(): # go through data dictionary created in the last block
fd_parse = FreqDist(content) # count number of times each token occured in the same job ad
tmp = "" # create a placeholder for word_index:word_freq
for (x,y) in fd_parse.items(): # iterate through each frequencies
tmp += f"{vocab[x]}:{y}," # build the dictionary of word_index:word_freq in the placeholder
f.write(f"#{jobID},{tmp[:-1]}\n") # write in the file line by line
Output of sparse.txt
:
— -end — -