From Raw Text to Insightful Analysis: NLP Text Preprocessing Explained

Published in

Women in Technology

12 min readMay 11, 2024

https://crewcare.co.nz/blog/seasonal-deep-cleaning-tips-for-winter/

The world is stepping into the realm where machines can understand human language and respond the way we do. NLP, transformer, LLM etc are buzzwords these days. Ready to dive deeper into the world of language modeling? Start with cleaning. Transforming the raw data into machine-readable format is a prerequisite for any language modeling. Let's get started with the Text Preprocessing.

Text Preprocessing

text = '''Top 20 Travel Bloggers in India From You Can Get Travel Idea.
Anuradha Goyal:

When it comes to exploring new destinations, there's no stopping this gypsy woman, who has made traveling the passion of her life. 
The innate wanderlust in Anuradha Goyal has taken her to almost every state in India and to 15 countries beyond the borders of 
this land. The travel blog that she has been writing since 2004 is a picturesque virtual diary, full of inspiring tales, and 
rich and useful information about the history and culture of the places she travels to. She is also a book reviewer and the 
author of The Mouse Charmers - Digital Pioneers of India.  

Blog: https://www.inditales.com/

Venkat Ganesh
Road trips are a thrilling way of discovering new destinations, and Venkat Ganesh is a pro at that. 
A well-known travel blogger who loves to be on the road, with his bike as the sole travel companion, Venkat quit his job to satiate his travel hunger. 
His blog depicts exciting sagas of his road journeys; at the same time, dishing out information for backpackers and intrepid travelers, who are always 
raring to go that extra mile in pursuit of their dream destinations. You will be amazed to read about his unplanned journeys that have worked out to be 
such exciting adventures in themselves.

Blog: http://www.indiabackpackmotorbike.com/

😀, 🌍, 🎉, 👋

<p>
WWF's mission is to stop the 
<strong>
degradation
</strong>
 of our planet's natural environment.
</p>

thx  b4n
:) , :(  

Café, Crêpes, Über, Pokémon, Résumé, naïve, façade, mañana, fiancé, déjà vu
'''

Above text will be used as input for all the text processing explained below.

1. Removal of non-printable (Non-ASCII) characters

Before eliminating non-ASCII characters, let’s take a moment to grasp the distinctions between ASCII and non-ASCII characters.

ASCII code enables a computer to recognize and display alphanumeric, symbols, numbers, and punctuation marks. Non-ASCII characters are a much larger list of special characters that include accented signs, glyphs, ideographs, Cyrillic letters, mathematical symbols, currency symbols, and more. 0–127 code stands for printable char where as 128 to 255 is reserved for non printable char.

ASCII code table

https://www.seozoom.com/non-ascii-characters/

Non-ASCII code table

# Remove non printable characters
def remove_non_ASCII(text_ip):
    text_ip = ''.join([word for word in text_ip if word in string.printable])
    pat = re.compile(r'[\t\n\r]')
    return pat.sub(r' ', text_ip)

text = remove_non_ASCII(text)
text

In the above code along with non ascii I’m also removing \t\n\r.

o/p:

2. Removal of all URL

URL present in the input text may not be of much use hence we can remove it or replace it with some place holder say ‘URL’ a string.

# Remove all URLs, replace by a string URL
def remove_URL(text_ip):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'URL',text_ip)

text = remove_URL(text)
text

i/p:

o/p

3. Removal of all HTML tags

Html tag does not add any value to the i/p that is required to process the data. Hence remove it.

i/p text:

# Remove HTML beacon
def remove_HTML(text):
    html=re.compile(r'<.*?>')
    return html.sub(r' ',text)
text = remove_HTML(text)
text

o/p text:

4. Converting transcript to strings

Transcripts such as sad face, happy face etc provide emotional information about the text hence converting it to words will add additional information to the input data.

def transcription_sad(text_ip):
    smiley = re.compile(r'[8:=;][\'\-]?[(\\/]')
    return smiley.sub(r'SAD FACE', text_ip)

text = transcription_sad(text)
text

# Replace some smileys with SMILE
def transcription_smile(text_ip):
    smiley = re.compile(r'[8:=;][\'\-]?[)dDp]')
    return smiley.sub(r'SMILE', text_ip)

text = transcription_smile(text)
text

5. Dealing with emojis

Note removal of non ascii char would remove all the emojies. If you consider the o/p of section 1 it doesnot contain any emojis. Therefore sequence of steps place a important role in retaining the information.

Let me consider the initial input to handle these emojies.

import emoji
text = emoji.demojize(text)
text

demonize will convert the emoji to respective description. Donot forget to import the lib emoji.

Another approach is remove all the emojis or create a place holder to it.

# Remove all emojis, replace by EMOJI
def remove_emoji(text_ip):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'EMOJI', text_ip)

6. Replacing abbreviation with its full form

i/p:

abbreviations = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
       "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}

def word_abbrev(word):
    return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word

# Replace all abbreviations
def replace_abbrev(text):
    string = ""
    for word in text.split():
        string += word_abbrev(word) + " "        
    return string

text = replace_abbrev(text)
text

7. Tokenization

Till the above steps, we were dealing with i/p which is in the form of sentences and paras. The model will not accept the data in these forms hence we need to break it into smaller units.

Tokenization is the process of breaking down a text or a sentence into smaller units called tokens. These tokens could be words, phrases, symbols, or other meaningful elements depending on the context and the specific requirements of the task.

In natural language processing (NLP), tokenization is a fundamental preprocessing step before any further analysis or processing. It enables machines to understand and process human language by converting raw text into a format that can be easily manipulated and analyzed computationally.

from nltk.tokenize import sent_tokenize, word_tokenize
def word_tokenize(text_ip):
    wr_tk = nltk.word_tokenize(text_ip)
    return wr_tk

text = word_tokenize(text)
text

Created by author

List consists of 302 words, let's reduce the number further.

8. Removal of stopwords

Stopwords are common words in a language that are often filtered out before or during natural language processing (NLP) tasks because they typically do not carry significant meaning or contribute much to the analysis. These words are so commonly used that they appear frequently in any given text but often do not add much value to the understanding of the content.

Stopwords can include words like “the,” “and,” “is,” “in,” “to,” “of,” “a,” “an,” “that,” “it,” “for,” “with,” “on,” “at,” “by,” “as,” “are,” “was,” “were,” “be,” “been,” “being,” and so on, depending on the language.

Removing stopwords from text data can help reduce the dimensionality of the data, speed up processing, and improve the performance of certain NLP tasks such as text classification, sentiment analysis, and information retrieval. However, the list of stopwords can vary depending on the specific task and domain, and it’s often customizable based on the needs of the analysis.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def remove_stopword(text_ip):
    remove_stopword = [word for word in text_ip if word.lower() not in stopwords.words('english')]

    return remove_stopword
text = remove_stopword(text)
text

Dimention is reduced from 302 to 193 by removal of stopwords.

9. Removal of punctuation

By importing the library string we can get the list of punctuations. Again these punctuations does not added any meaning to the list of tokens.

import string
def remove_punctuation(text_ip):
    wr_tk_punct = [''.join(char for char in item
                    if char not in string.punctuation)
                    for item in text_ip]
    return wr_tk_punct
text = remove_punctuation(text)
text

10. Removal of empty strings

Removal of punctuation will lead to empy strings hence these strings need to be removed.

def remove_empty_string(text):
    non_empty_str = [i for i in text if i] 
    return non_empty_str
text = remove_empty_string(text)
text

Now you can observe further reduction in the dimensions of token.

11. Dealing with numbers

Either you can remove the numbers or convert it to words. I would prefer converting it to words.

from num2words import num2words
def num_word(text_ip):
    text_ip = [num2words(int(word)) if word.isdigit() else word for word in text_ip ]
    return(text_ip)

text = num_word(text)
text

o/p

You can observe 20 got replaced by word.

12. SpellChecker

I’m going to introduce a typo

from spellchecker import SpellChecker
spell = SpellChecker()
def spell_check(text_ip):
    text_op = []
    
    for item in text_ip:
        typo_cor = spell.correction(item)
        if typo_cor == None: ## Spell correction will be none for name, place etc hence retaining it as it is
            text_op.append(item)
        else:
            text_op.append(typo_cor) 
    return text_op
text = spell_check(text)
text

Catch you in the next blog about Lemmination and stemming. Meantime please take a look at previous blogs Understanding Ambiguities in Natural Language Processing .

EndNote:

Thanks for reading the blog, I hope enjoyed working on text preprocessing. Have thoughts or questions? We’d love to hear from you! Feel free to leave a comment below.

Would love to catch you on Linkedin. Mail me here for any queries.

Stay tuned for more exciting content till then Happy reading!!!!

I believe in the power of continuous learning and sharing knowledge with the community. Your contributions are invaluable in helping me create meaningful content and resources that benefit everyone. Join me on this journey of exploration and innovation in the fascinating world of data science by donating to Buy Me a Coffee.