NLP :Basic Lexical Processing for Text Analysis

NitinKumar Sharma
7 min readDec 31, 2023

--

In the vast landscape of data, text holds a treasure trove of information waiting to be discovered. The journey to unlock this treasure involves three stages of Text Analytics, and at the forefront of this adventure stands the first stage — Lexical Analysis. Let’s embark on this exciting journey and explore the significance of Lexical Analysis in the grand narrative of Text Analytics.

Stage 1: Lexical Analysis — Decoding the Language

In the enchanted world of text, Lexical Analysis serves as the key to deciphering the language’s nuances. It is the foundational stage, laying the groundwork for deeper insights. This stage involves understanding the basic building blocks of language — the words.

Once upon a time, in the kingdom of Data Science, there existed a magical dataset filled with words. This dataset held the key to unraveling the mysteries of language, and our journey began with the desire to comprehend its secrets.

Chapter 1: The Symphony of Words

In this enchanted dataset, words danced gracefully, creating a symphony of language. To understand this symphony, we first embarked on the quest to discover the frequencies of each word — how often they graced the stage. The wise Python sorcerer, known as Counter, helped us unveil the Word Frequencies.

from collections import Counter

# Our magical dataset
text = "The quick brown fox jumps over the lazy dog. The dog barks, and the fox runs away."

# Tokenize the text into words
words = text.lower().split()

# Count word frequencies
word_freq = Counter(words)

# The magical reveal
print("Word Frequencies:")
print(word_freq)

The Word Frequencies revealed the heartbeat of our dataset, showcasing which words played the lead roles and which preferred the shadows.

Chapter 2: Tokenization — Unveiling the Actors

As we delved deeper into our dataset, we realized that understanding individual words was not enough. We needed to meet each actor — each word — on a personal level. Thus, we introduced Tokenization, a spell that transformed paragraphs into individual actors — tokens.

from nltk.tokenize import word_tokenize

# A scene from our dataset
sentence = "The cat in the hat."

# Tokenize into words
word_tokens = word_tokenize(sentence)

# The actors take the stage
print("Word Tokens:")
print(word_tokens)

Sentence Tokenization:

from nltk.tokenize import sent_tokenize

# Our dataset unfolds with multiple scenes
text = "This is the first sentence. And this is the second one."

# Tokenize into sentences
sentence_tokens = sent_tokenize(text)

# Scenes become acts
print("Sentence Tokens:")
print(sentence_tokens)

Tokenization allowed us to break down the narrative into individual acts and scenes, setting the stage for the next chapter.

Embark on a diverse array of tokenization techniques:

Word Tokenizer: Splits text into different words.

Sentence Tokenizer: Segments text into distinct sentences.

Tweet Tokenizer: Navigates emojis and hashtags in the realm of social media texts.

Regex Tokenizer: Empowers you to craft custom tokenizers using regex patterns of your choosing.

Chapter 3: Bag-of-Words — Crafting the Script

In the realm of language, the order of words sometimes mattered less than their presence. Thus, we introduced the concept of Bag-of-Words, where each document became a bag filled with the essence of words, their frequencies capturing the script’s soul.

from sklearn.feature_extraction.text import CountVectorizer

# Characters speak in our story
documents = ["I love coding.", "Coding is fun."]

# Craft the script – Bag-of-Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# The script unfolds
print("Bag-of-Words Representation:")
print(bow_matrix.toarray())

The Bag-of-Words representation transformed our dataset into a script, laying the foundation for the characters’ dialogues.

Chapter 4: The Essence of Words — Stemming and Lemmatization

As our characters (words) wore various costumes, we realized the need to simplify their identities. Enter Stemming and Lemmatization — the costume designers of our linguistic world.

from nltk.stem import PorterStemmer

# The characters' diverse forms
words = ["run", "running", "ran"]

# The magical transformation – Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]

# The characters' true essence
print("Stemmed Words:")
print(stemmed_words)

Stemming trimmed our characters to their core forms, ensuring consistency across scenes.

Lemmatization:

from nltk.stem import WordNetLemmatizer

# Characters continue their tale
words = ["running", "better", "is"]

# The refined essence – Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Characters find their true selves
print("Lemmatized Words:")
print(lemmatized_words)

Lemmatization took characters a step further, preserving their true identities throughout the narrative.

Chapter 5: The Grand Finale — TF-IDF Representation

In the grand finale, we introduced TF-IDF — the critic that emphasized the importance of words within each document. It was time to let the words shine in the spotlight while keeping the story cohesive.

from sklearn.feature_extraction.text import TfidfVectorizer

# The climax of our story
documents = ["I love coding.", "Coding is fun."]

# The crescendo – TF-IDF Representation
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# The masterpiece unfolds
print("TF-IDF Representation:")
print(tfidf_matrix.toarray())

TF-IDF added a new dimension, ensuring each character (word) had its moment of glory while contributing to the overall plot.

Let’s solve some question based on what you have learned from above:

1: The frequency of words in any large enough document is best approximated by which distribution?

Options:
A. Gaussian distribution
B. Uniform distribution
C. Zipf distribution
D. Log-normal distribution

Answer: C. Zipf distribution

Explanation: Word frequency is best approximated by the Zipf distribution, which states that the frequency of a word is inversely proportional to its rank.

2: Which of the following words is not a stop word in the English language according to NLTK’s English stopwords?

Options:
A. I
B. Has
C. Yes
D. Was

Answer: C. Yes

Explanation: The word ‘yes’ isn’t a stop word. Stop words are common words that usually don’t contribute much to the meaning of a sentence.

3: Which of the following words can’t be reduced to its base form by a stemmer?

Options:
A. Bashed
B. Cowardly
C. Worse
D. Sweeping

Answer: C. Worse

Explanation: The base form of the word ‘worse’ is ‘bad’. ‘Worse’ can’t be reduced to ‘bad’ using a stemmer, as stemming involves removing suffixes.

4: What is the rationale behind the concept of inverse document frequency to capture word importance?

Options:
A. If a term appears in many documents, it is considered as important.
B. If a term appears only in a select few documents, it is considered as important.
C. The importance of a term doesn’t depend on its frequency in other documents.
D. None of the above

Answer: B. If a term appears only in a select few documents, it is considered as important.

Explanation: IDF considers the rarity of a term across documents, providing a measure of its significance.

5: Which of the following words is the correct stemmed version of the word ‘happily’ if you use the NLTK’s Porter stemmer?

Options:
A. Happili
B. Happy
C. Happi
D. None of the above

Answer: A. Happili

Explanation: NLTK’s Porter stemmer produces ‘happili’ as the stemmed version of ‘happily’.

6: Which of the following statements is incorrect with regards to the bag-of-words model?

Options:
A. The number of rows is equal to the number of documents.
B. The number of columns is equal to the vocabulary size of the text corpus.
C. The value inside a cell corresponding to a document ‘d’ and a term ‘t’ is non-zero if ‘t’ appears one or more times in ‘d’.
D. The number of columns is equal to the total number of words present in the text corpus.

Answer: D. The number of columns is equal to the total number of words present in the text corpus.

Explanation:The number of columns in the bag-of-words model is equal to the vocabulary size, not the total number of words in the corpus.

Epilogue: The Artistry of Language

As our journey through the magical dataset concluded, we marveled at the artistry of language. Each chapter — from understanding frequencies and tokenizing scenes to crafting scripts and refining characters –

played a vital role in decoding the language’s magic.

Armed with Python spells and curiosity, anyone could embark on a similar adventure, turning the pages of the language’s enchanting tale. The dataset, once a mysterious collection of words, now stood revealed and understood. And so, our story in the kingdom of Data Science ended, leaving behind a trail of knowledge and the promise of more linguistic adventures to come. Happy coding!

--

--