Text Processing Tools I Wish I Knew Earlier

Jao Ming
Analytics Vidhya
Published in
5 min readSep 6, 2021

Introduction

Natural Language Processing (NLP) has been my field of interest throughout my journey practising data science, from school work to internship to my full-time career. It has always been intriguing and fascinating to read the latest and greatest NLP technologies like transformers. However, one thing that remains as cornerstone in NLP is text processing.

I have actually written a medium article that outlines some of the usual processes such as lemmatizing and stemming, and the removal of stopwords and punctuations. If you want to find out more about it, here is a link to the article I am talking about. Albeit there are also countless of other articles that have already gone through them before. Which brings me to the reason why I am writing this article. I would like to share 2 text processing functions that I felt were useful throughout my data science journey and that I have not found much literature on.

They are Hashtag Processing and Gibberish Detection.

Hashtag Processing

Back when I was working at a MarTech startup that focused on influencers, I had to deal with a lot of data from social media. This meant that text data would be much messier and less structured as compared to news articles or research papers. And given that most of these social media text are personalised, they also differ slightly to text like Amazon reviews. With social media data, there would be a lot more emojis, mentions, slangs and hashtags.

Hashtags are definitely one of the more problematic components of the text. It would be easy to remove emojis and mentions, or replace slangs. However, even if the hashtag symbol is removed, it still leaves a combination of characters with no whitespaces in between. As most text processing tokenizes words in a sentence through the separation of white spaces

"This is a sentence" -> ["This", "is", "a" "sentence"]

Simply removing the hashtag symbol would result in the character combination being considered a token itself:

"foodislife" -> ["foodislife"]

# instead of

"foodislife" -> ["food", "is", "life"]

Hence, a hashtag processing function that would be able to split up this combination into their individual words.

Here is the code in full:

How It Works

Based on a corpus, we generate a dictionary of word counts. So we literally count the occurrence of each word in the corpus. From this, we iterate character by character through the combination of letters and at each iteration, we calculate the probability of it being a word using the dictionary.

The probability is calculated based on how likely that window of combination of letters is to be a word and also the probability of the starting index of the window being a good starting point to be a word.

Through this method, at each of the possible splits in the combination of letters we would have the probability of that split forming a word. The algorithm will then keep track of the indexes where the probabilities are the highest and split the combination of letters at those points to form a sentence

'ilovefood'# iteration 1:
'i'
## at 'i', since it's such a common word, it would have a high probability
## hence, we consider it a word
# iteration 2:
'l'
## 'l' is not common by itself and hence would have a low probability of it being a word
# iteration 3:
'lo'
## 'lo' is also not common.
# and so on.. until we split at
'i', 'love', 'food'

Note that the word_dict.json file came from the generate_word_dictionary function above. I simply exported the dictionary into a .json file. And for your convenience, here is a function that exports the dictionary into a .json file:

import jsondef export_dict_to_json(dictionary: dict, file_name: str):
with open(file_name, 'w') as file:
json.dump(dictionary, file)

Gibberish Detection

This was not all around as useful as the hashtag function, however I genuinely found this interesting and to be extremely useful when it is required. I found this function especially useful to use before performing data exploration. Reason for this is because some sentences in a dataset might have extremely long gibberish in them. This ultimately skews the metrics of the sentence and dataset and makes it difficult to truly understand the dataset.

Before I explain how the algorithm works, let me show the full code:

The gibberish detection function works similarly to the hashtag processing function. In a sense that both of them makes use of probabilities. However, instead of getting the probability of words, the gibberish detection function gets the probability of a letter following another letter. Example, the probability of ‘a’ being after ‘b’. Or the probability of the letter ‘z’ being after ‘k’.

How It Works

From a corpus, the generate_gibberish_model would read each sentence and calculate the occurrence between 2 letters.

'this is not gibberish''t' -> 'h'
'h' -> 'i'
'i' -> 's'
's' -> ' '
' ' -> 'i'
...
'b' -> 'e'
'e' -> 'r'
'r' -> 'i'
'i' -> 's'
's' -> 'h'

This would generate counts of the occurrence between 2 letters. The dictionary output of this function would then be used in the function to calculate how likely a string of characters would be gibberish or not. Where a higher probability shows that the string of characters are likely and possible, whereas a low probability would show that the combination of characters make no sense and is gibberish.

The way this is done is by iterating through a string of characters, retrieving the probability of each of the 2 letters occurring and then summing up all the probabilities together to get a single probability value. As seen in the code, log probabilities are used instead of straight probabilities so as to avoid numeric underflow issues that could arise from long texts.

This method will return a probability value regardless of whether the token is gibberish or not. Hence, there is a need to declare a threshold value to determine at what probability value can we be confident that the token is gibberish. In the code above I used 0.0188 . Any value can be used, but let me explain how I got this value.

I had a few examples of good text and gibberish text. I then used the gibberish detection function on all the good and gibberish texts and found a value that sits between the minimum threshold to be considered non-gibberish and the maximum threshold to be considered gibberish. And for me, that value is 0.0188 .

Conclusion

The logic for these functions are pretty simple but yield great performance when it comes to text processing. I hope that these functions manage to help someone as much as it had helped me and is currently helping me in becoming a better data scientist. And if any of you have come across any interesting text processing methods/functions/pages do leave them in the comments below! Thanks for reading!

--

--

Jao Ming
Analytics Vidhya

Building AI Solutions on a Global Scale. There's no good AI without engineering.