NLP 101

Sparsh Goyal
11 min readJul 28, 2024

--

From Basic Preprocessing to the Magic of Attention Mechanisms

Natural Language Processing (NLP) is the technology that enables machines to understand and respond to human language, making interactions with computers and machines more intuitive and conversational.

source: https://www.getyarn.io/yarn-clip/8ea734be-1555-4746-98f3-894c48359d4b/gif

Table of Contents:

  • Intro to NLP
  • Basic Preprocessing
  • Advanced Preprocessing
  • Feature Extraction Techniques
  • Intro to RNNs
  • LSTM
  • GRU
  • Seq-2-Seq Model
  • Attention Mechanism

Intro to NLP

Natural Language Processing (NLP) is the AI technology that enables machines to understand human speech in text or voice form in order to communicate with humans our own natural language.
Imagine teaching your toaster to understand your deepest existential questions. That’s the essence of NLP, minus the existential crisis.

NLP enables computers and digital devices to recognize, understand and generate text and speech by combining computational linguistics together with statistical modeling, machine learning (ML) and deep learning.

The first cornerstone of NLP was set by Alan Turing in the 1950’s, who proposed that if a machine was able to be a part of a conversation with a human, it would be considered a “thinking” machine.

Different Levels of NLP

NLP is a multi-layered process, each level diving deeper into the nuances of language. Think of it like peeling an onion — each layer brings out a new dimension of understanding. Let’s explore these levels:

  • Morphological:
    The morphological level of linguistic processing deals with the study of word structures and word formation, focusing on the analysis of the individual components of words.
  • Lexical:
    Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.
  • Syntactic:
    The next step after lexical analysis is where we try to extract more meaning from the sentence, by using its syntax this time. Instead of only looking at the words, we look at the syntactic structures, i.e., the grammar of the language to understand what the meaning is.
  • Semantic:
    This level entails the appropriate interpretation of the meaning of sentences, rather than the analysis at the level of individual words or phrases.
  • Discourse:
    Discourse processing is a suite of Natural Language Processing (NLP) tasks to uncover linguistic structures from texts at several levels, which can support many NLP applications.It deals with the analysis of structure and meaning of text beyond a single sentence, making connections between words and sentences.
Working of NLP

Use cases of NLP

  • Translation and summarization
  • Chatbots
  • Grammar/spelling check
  • Sentence completion
  • Data analytics

Challenges of NLP

  • Sarcasm
  • Phrase ambiguity
  • Slang or street language
  • Domain-specific language
  • Bias in training data

Basic Preprocessing

Before diving into the complex stuff, let’s start with some basic preprocessing — think of it as a spa day for your text data. This step involves cleaning and preparing the text for further analysis.

1. Tokenization

A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements.
Tokenization immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning.

Example:
Input: "I love NLP."
Output: ["I", "love", "NLP", "."]

2. Case Folding

Case Folding simply refers to lower- or upper- casing all tokens

Example:
Input: "The Quick Brown Fox jumps Over the Lazy Dog."
Output: "the quick brown fox jumps over the lazy dog."

3. Stop Words Removal

Removal of words which occur frequently but doesn’t convey much meaning.

Example:
Input: "The quick brown fox jumps over the lazy dog."
Output: "quick brown fox jumps lazy dog."
(Here, common words like "The" and "the" are removed.)

4. Stemming

Refers to removing word suffixes and prefixes.

Example:
Input: "running runningly runs runners"
Output:"run runnl run runner"

5. Lemmatization

Reducing a word to it’s dictionary or lemma form.

Example:
Input: "running runningly runs runners"
Output: "run run run runner"

Stemming may result in less readable forms, while lemmatization returns more meaningful root forms as it also takes into account whether a word is noun, verb, etc.

Advanced Preprocessing

Now, let’s get fancy with some advanced preprocessing.

1. Part-of-Speech (POS) Tagging

Labeling words with their grammatical roles.
POS tagging can help discover intent or action

Example:
Input: "The cat sat on the mat."
Output: [(The, DT), (cat, NN), (sat, VBD), (on, IN), (the, DT), (mat, NN)]
(Here, DT = Determiner, NN = Noun, VBD = Verb, Past Tense, IN = Preposition.)

2. Named Entity Recognition (NER)

Identifying names, dates, locations, and more.
Imagine it as having a personal assistant who tags all the VIPs in your text.
Can be used in categorizing corpus, question answering, etc.

Example:
Input: "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino."
Output: [(Apple Inc., ORGANIZATION), (Steve Jobs, PERSON), (Steve Wozniak, PERSON), (Cupertino, LOCATION)]

3. Parsing

Parsing is determining the syntactic structure of a sentence.
2 types of parsing:

  • Constituency Parsing
    Constituency parsing analyzes the sentence structure by breaking it down into nested sub-phrases, or constituents, which can be combined to form the full sentence. It represents sentences in a hierarchical tree structure called as parsing tree.

Example:
Input: "The cow jumped over the moon."

source: https://static.packt-cdn.com/products/9781784391799/graphics/3785_07_01.jpg
  • Dependency Parsing
    Dependency parsing focuses on the grammatical relationships between words, showing how each word depends on other words in the sentence. It represents sentences as a directed graph with words as nodes and dependencies as edges.

Example:
Input: "The cat sat on the mat."

Dependency Graph:

  • sat (verb) → The (determiner) (subject)
  • sat (verb) → cat (noun) (subject)
  • sat (verb) → on (preposition) (prepositional object)
  • on (preposition) → the (determiner) (object of preposition)
  • on (preposition) → mat (noun) (object of preposition)

Explanation: This graph shows that “sat” is the main verb and “cat” is the subject. The preposition “on” connects “sat” to “mat” with “the” as its determiner.

Feature Extraction Techniques: Transforming Text into Numbers

To analyze and model text data, we need to convert it into numerical formats. Here are some common techniques:

1. Bag of Words (BoW)

BoW represents text as a collection of words without considering their order. It focuses on word frequency or binary presence.

  • Binary BoW: Indicates whether a word is present (1) or absent (0) in a document.

Example:
Text: "Cats are great pets."
Vocabulary: ["cats", "are", "great", "pets"]
Binary Representation: [1, 1, 1, 1]

  • Frequency BoW: Counts the number of times each word appears.

Example:
Text: "Cats are great pets. Cats are fun."
Vocabulary: ["cats", "are", "great", "pets", "fun"]
Frequency Representation: [2, 2, 1, 1, 1]

2. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF measures how important a word is in a document relative to a collection of documents. It balances word frequency with its importance across multiple documents.

  • Term Frequency (TF): Number of times a word appears in a document.
    Text: "Cats are great pets."
    TF for "cats": 2 / 5 = 0.4 (if “cats” appears twice in a 5-word document)
  • Inverse Document Frequency (IDF): Measures how common or rare a word is across all documents.
    Corpus: ["Cats are great pets.", "Dogs are great pets too."]
    IDF for "cats": log(2 / 1) = 0.301 (if "cats" appears in 1 of 2 documents)
  • TF-IDF: TF * IDF = 0.4 * 0.301 = 0.120
image from: http://filotechnologia.blogspot.com/2014/01/a-simple-java-class-for-tfidf-scoring.html
source: https://www.youtube.com/watch?v=fIYSi41f1yg&list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS&index=6

3. N-grams

N-grams are sequences of n consecutive words or characters. They capture context and word combinations.

Example:

  • Unigrams (1-grams): Individual words.
    Text: "Cats are great pets."
    Unigrams: ["Cats", "are", "great", "pets"]
  • Bigrams (2-grams): Pairs of consecutive words.
    Text: "Cats are great pets."
    Bigrams: ["Cats are", "are great", "great pets"]

4. Word2Vec

Word2Vec learns vector representations of words based on their context in a large corpus. It captures semantic meanings and relationships between words.

source: https://jalammar.github.io/illustrated-word2vec/

Intro to RNNs: The Memory Masters

Recurrent Neural Networks (RNNs) are like that friend who remembers everything you’ve ever said — sometimes even the embarrassing stuff. They’re designed to handle sequences, making them ideal for tasks like language modeling and time-series prediction.

RNNs work by maintaining a hidden state that carries information from one step to the next.
They apply a recurrence relation at every time step to process a sequence:

source: https://rkenmi.com:8443/rnn-recurrent-neural-networks/

LSTM: The RNN’s Brainy Cousin

Long Short-Term Memory networks (LSTMs) — the RNNs with a Ph.D. in memory management. They’re designed to handle long-range dependencies.
LSTMs use memory cells and three gates — input, forget, and output — to regulate the flow of information. This structure helps LSTMs remember important information for long periods and forget less relevant details.

For diving deep into LSTMs and their working, i recommend this blog.

GRU (Gated Recurrent Unit)

GRU networks are a streamlined variant of LSTMs that also capture long-term dependencies in sequences. They employ two gates — reset and update — to manage the flow of information.
GRUs are computationally more efficient than LSTMs, offering faster training while effectively retaining important information over time.

  • Reset Gate: This gate decides how much of the past information to forget. When the reset gate is close to zero, it effectively forgets the past state, focusing more on the current input.
  • Update Gate: This gate determines how much of the past information to retain and how much of the new information to add. It blends the previous hidden state and the current input, enabling the model to decide how much of the past knowledge should carry forward to the future.

Seq-2-Seq

Sequence-to-Sequence (Seq-2-Seq) models are the go-to for tasks that involve converting one sequence to another, like translation or summarization.
They use an encoder-decoder architecture, where the encoder processes the input sequence, and the decoder generates the output sequence.

Encoder

  • The encoder processes the input data, training on it and passing the last state of its recurrent layer as the initial state for the decoder.
  • It reads the input sequence and summarizes the information into internal state vectors (context vector), such as hidden and cell state vectors in LSTM. We discard the encoder’s outputs and keep only the internal states to help the decoder make accurate predictions.

Decoder

  • The decoder uses the last state of the encoder’s recurrent layer as the initial state for its first recurrent layer and processes the desired output sequences.
  • The encoder’s final context vector is input to the first cell of the decoder. Using these initial states, the decoder generates the output sequence, incorporating previous outputs to inform future predictions.
source: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

Attention Mechanism: The VIP Treatment

Finally, let’s talk about attention mechanisms — these are like giving VIP treatment to important parts of your input sequence. Instead of processing everything equally, attention mechanisms let the model focus on the most relevant parts of the text.
Before going deep into attention, let’s first talk about the information bottleneck.

The Information Bottleneck

The encoder is “forced” to send only a single vector, regardless of the length of our input i.e. how many words compose our sentence.

Even if we decide to use a large number of hidden units in the encoder with the aim of having a larger context, then the model overfits with short sequences, and we take a performance hit as we increase the number of parameters.

Attention can be used to overcome this information bottleneck.

An attention mechanism is a part of a neural network. At each decoder step, it decides which source parts are more important. In this setting, the encoder does not have to compress the whole source into a single vector — it gives representations for all source tokens (for example, all RNN states instead of the last one).

Attention: At different steps, let a model “focus” on different parts of the input.

source: https://www.youtube.com/watch?v=tvIzBouq6lk&t=288s
  1. Score Calculation: For each output token, compute a score for each input token based on their relevance. Common scoring functions include dot product, additive, and scaled dot product.
  2. Softmax Normalization: Apply a softmax function to the scores to obtain attention weights, which are probabilities that sum to one.

3. Context Vector Creation: Multiply each input token’s representation by its corresponding attention weight and sum these weighted representations to create a context vector.

4. Combining Context and Output State: Combine the context vector with the decoder’s current state to produce the final output.

5. Output Generation: Use the combined vector to generate the final output token.

This step-by-step process allows the model to focus on different parts of the input sequence for each output token, improving the handling of long sequences and dependencies.

Additive/Bahdanau Attention

  • Mechanism: Computes a score using a feedforward neural network that combines the decoder’s state and the encoder’s outputs.
source: https://arxiv.org/abs/1409.0473
  • Characteristics: Flexible and effective for various tasks, but computationally more intensive due to the extra neural network layer.

Multiplicative/Luong Attention

  • Mechanism: Computes a score using the dot product between the decoder’s state and the encoder’s outputs.
source: https://arxiv.org/abs/1508.04025
  • Characteristics: Faster and simpler to compute, especially efficient with scaled dot-product attention used in Transformer models.

Conclusion

And there you have it! From basic preprocessing to the intricate attention mechanisms, NLP is a journey full of exciting challenges and breakthroughs. Whether you’re just getting started or looking to deepen your knowledge, remember: in the world of NLP, the only limit is your imagination (and maybe your dataset).

Credits:

Engage with Me

I would love to hear your thoughts and feedback on this post. If you have any questions or suggestions, feel free to leave a comment below or contact me.

Connect with me on LinkedIn

Happy coding, and may your models always be accurate and your errors easily debugged!

--

--