MSc Data Science in the UK — Applied NLP — Week5

5 min readJul 17, 2024

Hi everyone, welcome to my week 5 Master’s in the UK session (I know it’s been a while since the last update. I will try my best to update the article every day from now.)

We’ve looked at some techniques to help us piece an NLP project together. Refer to the link, if you haven’t read it yet.

This week, we will look into text the details of documents and preprocessing

Hope you guys enjoy and let’s dive in.

Introduction to Document Retrieval Scenario

The goal of document retrieval is to efficiently find and retrieve all documents containing specific words or phrases from a large digital collection (corpus). This task involves representing and processing text in a way that allows for quick and accurate searches.

Below are some techniques that have been applied in the industry:

Naive String Matching Algorithms: Basic algorithms that search for exact matches of a given word or phrase within a document. While simple, these can be inefficient and often yield unexpected results due to their lack of understanding of word boundaries and variations.
Inverted Indexes: Advanced data structures that map words to the documents in which they appear. This significantly speeds up the search process, as it allows for quick lookups of documents containing specific terms.
Term Frequency-Inverse Document Frequency (TF-IDF): A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). TF-IDF helps in identifying the most relevant documents for a given query by weighing the frequency of terms against their overall occurrence in the corpus.
Retrieval-Augmented Generation (RAG): A cutting-edge approach that combines document retrieval with text generation. RAG leverages pre-trained language models like BERT or GPT to retrieve relevant documents and then generates a coherent response based on the retrieved information. This method enhances the quality and relevance of the generated text by incorporating specific and contextually appropriate information from the retrieved documents.

Segmentation and Tokenization

Segmentation and tokenization involve breaking down a corpus into smaller, meaningful units such as documents, paragraphs, sentences, words, morphemes, and characters. This step is essential for further processing and analysis of text data.

Sentence Segmentation:

Rule-Based Methods: These methods use predefined rules, such as punctuation marks (e.g., periods, question marks) to identify sentence boundaries. However, they often struggle with ambiguities like abbreviations and titles.
Machine Learning-Based Methods: Modern approaches involve training binary classifiers to identify sentence boundaries based on features like punctuation and capitalization. These methods can handle more complex cases and are generally more accurate.
Python NLTK Punkt Sentence Segmenter: This is a popular tool for sentence segmentation. It uses an unsupervised algorithm to detect sentence boundaries based on statistical models of punctuation and capitalization.

Tokenization:

Python Split Function: This built-in function can be used for simple tokenization tasks where words are separated by whitespace.
Regular Expressions: More complex tokenization tasks can be handled using regular expressions, which allow for more precise control over the splitting process.
NLTK Regular Expression Tokenizer: This tokenizer from the NLTK library provides robust tokenization capabilities, allowing for customized token patterns.

Next, we will be looking at an interesting topic How Many Words Are There?
Understanding the number and types of words in a text is crucial for various NLP tasks. This involves counting tokens and distinguishing between types (unique words) and tokens (total words).

Tokenization: The process of segmenting text into individual words (tokens), which can then be counted.
Herdan-Heaps Law: This law describes the relationship between the size of a corpus and the number of unique word types. It is used to estimate the vocabulary size of a text.
Zipf’s Law: This statistical principle states that the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. This helps in understanding the distribution of word frequencies and identifying common and rare words.

Normalization

Normalization is the process of standardizing text data to ensure consistency and accuracy in analysis. This includes converting text to a consistent format and handling variations in representation.

Case Normalization: Converting all text to lowercase to eliminate discrepancies between uppercase and lowercase letters. This reduces the number of unique word forms. Number Normalization: Replacing numbers with a generic string (e.g., “NUM”) to standardize numerical data.
Stopword Removal: Removing common words that do not carry significant meaning (e.g., “the,” “is,” “and”) using a predefined stopword list. This simplifies the text and focuses analysis on more meaningful words.
Punctuation Removal: Stripping out punctuation marks to clean the text and reduce noise in the data.

Stemming and Lemmatization

Stemming and lemmatization are techniques used to reduce words to their base or root form. This helps in grouping different forms of a word together for analysis.

Stemming:

Porter Stemmer (NLTK): A widely used stemming algorithm that applies a series of rules to remove suffixes and reduce words to their base form. While efficient, it can sometimes produce incorrect results (false positives and negatives).

Lemmatization:

WordNet Lemmatizer (NLTK): This tool uses a lexicon and morphological analysis to convert words to their base form (lemma). It provides more accurate results than stemming by considering the context and meaning of words.

Advanced Methods for Text Processing

Utilizing modern methods and tools for more refined text processing and analysis, improving the efficiency and effectiveness of NLP tasks.

List Comprehensions and Nested List Comprehensions: Efficient ways to process and manipulate lists in Python, reducing the amount of code needed and improving readability.
Random Sampling: Techniques for generating diverse samples of data, ensuring that analysis and model training are representative and unbiased. Python’s random library is commonly used for this purpose.
Shell Commands in Notebooks: Using shell commands within Jupyter Notebooks to manage resources and files, streamline workflows, and enhance productivity. Commands like `!ls` and `!unzip` are examples.

Feel free to drop me a question or comment below.

Cheers, happy learning. I will see you week6.

The data journey is not a sprint but a marathon.

Medium: MattYuChang

LinkedIn: matt-chang

Facebook: Taichung English Meetup

(I created this group four years ago for people who want to hone their English skills. Events are held regularly by our awesome hosts every week. Follow the FB group link for more information!)