Natural Language Processing (NLP) Zero to Mastery Part I: Foundations

Unlocking Ten Key Concepts for NLP Proficiency

ChenDataBytes
10 min readDec 28, 2023
Photo by Sven Brandsma on Unsplash

The articles in this series cover the following topics:

  • Part 1(this article): Presents the fundamental principles of Natural Language Processing (NLP).
  • Part 2: Explores the common applications of NLP.

Natural Language Processing (NLP) is a field of study within computer science and artificial intelligence that focuses on the interaction between computers and human languages. Its objective is to enable computers to understand, interpret, and generate human language, thereby facilitating communication and interaction between humans and machines.

NLP relies on various libraries commonly employed in the field, such as NLTK and Spacy. NLTK provides a comprehensive set of tools and resources for NLP tasks, while Spacy offers efficient processing capabilities. However, it’s worth noting that although Spacy is efficient, it may not be suitable for certain applications like sentiment analysis, which may require more specialized libraries or approaches. For sequence modelling in NLP, TensorFlow’s deep learning framework can be utilized.

Regarding the foundational aspects of NLP, we will delve into ten essential topics: lemmatization, stemming, part-of-speech tagging, stop words, pattern matching, sentence segmentation, named entity recognition, tokenization, word embedding and bag-of-words.

NLP concepts

Linguistic Basics

1. Stemming

Stemming is a linguistic method utilized to obtain the base or root form of words by removing letters from the word’s end. Its objective is to simplify words by disregarding tense, pluralization, and other grammatical variations. The Porter stemming algorithm employs a collection of predetermined rules and heuristics to eliminate common English suffixes, thereby converting words into their corresponding stems. SpaCy doesn’t have a built-in implementation of the Porter Stemmer, so we use nltk for this exercise.

from nltk.stem.porter import *

p_stemmer = PorterStemmer()
words = ["runner", "running", "ran"]
for word in words:
print(word+' --> '+p_stemmer.stem(word))

2. Lemmatization

In contrast to stemming, lemmatization is a more sophisticated linguistic process that aims to reduce a word to its base or dictionary form, known as a lemma. Lemmatization takes into account factors such as part-of-speech (POS) tags and contextual understanding to ensure accurate and meaningful transformations.

import spacy
nlp = spacy.load('en_core_web_sm')

doc1 = nlp(u"The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face.")

for token in doc1:
print(token.text, '\t', token.lemma, '\t', token.lemma_)

3. Part of speech

Part of speech refers to the grammatical category of a word in a sentence, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or interjection. Part of speech tagging can be used for various purposes, including identifying named entities and speech recognition. The probabilities of part of speech tags occurring near one another can be used to generate the most reasonable output.

import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u"The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face.")
#print pos for “after"
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))

Text Identification/ Extraction

4. Stop words

Common words such as “a” and “the” occur so frequently in text that they often do not carry significant meaning compared to nouns, verbs, and modifiers. These commonly occurring words are referred to as stop words and can be excluded or filtered out during text processing. Spacy provides a built-in list of approximately 305 English stop words that can be readily utilized.

import spacy
nlp = spacy.load('en_core_web_sm')

sentences = "The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face."

def remove_stopwords(sentence):
sentence = sentence.lower()
words = sentence.split()
sentence = " ".join([w for w in words if not nlp.vocab[w].is_stop])
return sentence

remove_stopwords(sentences)

5. Pattern Match

Pattern matching entails the identification and extraction of linguistic patterns or structural information from text. This process involves searching for specific sequences of words, phrases, or syntactic structures that conform to predefined patterns or rules.

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face.')

matcher = PhraseMatcher(nlp.vocab)

phrase_list = ['runner', 'smile']
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newproduct', None, *phrase_patterns)

matches = matcher(doc)

#Print the matches found in the text.
#Each match is represented as a tuple containing the match ID, start index, and end index in the text.
print(matches)

6. Sentence Segmentation

Sentence segmentation refers to the process of dividing a document or a textual piece into individual sentences. In natural language processing, accurately identifying sentence boundaries is crucial for various text analysis and language understanding tasks.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc.sents:
print(sent)

7. Named Entity Recognition (NER)

Named Entity Recognition (NER) entails the identification and classification of named entities present in the given text. Named entities typically encompass distinct types of words or phrases that represent recognizable entities, including but not limited to names of individuals, organizations, locations, dates, numerical expressions, and others.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face.')
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

Text Representation

8. Tokenization

Tokenization involves breaking down the original text into smaller components known as tokens. These tokens can be created based on contiguous sequences of characters or words. The utilization of spacy for tokenization is demonstrated below.

import spacy
nlp = spacy.load('en_core_web_sm')
mystring = '"The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face."'
doc = nlp(mystring)

for token in doc:
print(token.text, end=' | ')

# Counting Tokens
print("\n Token Counts:",len(doc))

# Counting Vocab Entries
print("\n Vocab Entries: "+str(len(doc.vocab)))

N-gram refers to a consecutive sequence of n items, where an item can be a character or a word.

from nltk import bigrams
text = """The dedicated runner, after running for hours, finally ran across the finish line with a triumphant smile on their face."""
lines = map(str.split, text.split('\n'))
for line in lines:
print("\n".join([" ".join(bi) for bi in bigrams(line)]))

The following example illustrates how to generate tokens, sequences, and perform padding using the TensorFlow framework. In NLP model training, it is also common to create input-output pairs, where the input consists of a sequence of words or characters, and the output is the subsequent word or character.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define your input texts
sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)
# Get the word index dictionary
word_index = tokenizer.word_index
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)
# Print the result
print("\nPadded Sequences:")
print(padded)

Subword tokenization is a text tokenization technique that breaks down words into smaller units, known as subwords or subword units. Unlike traditional word-based tokenization, where each word is considered a single token, subword tokenization allows the representation of words as a sequence of subword units.

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset
imdb_subwords, info_subwords = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)
train_data, test_data = imdb_subwords['train'], imdb_subwords['test'],

# Get the encoder
tokenizer_subwords = info_subwords.features['text'].encoder

# Define sample sentence
sample_string = 'TensorFlow, from basics to mastery'

# Encode using the subword text encoder
tokenized_string = tokenizer_subwords.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

# Decode and print the results
original_string = tokenizer_subwords.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

9. Word vectors/word embeddings

Basic word representations could be classified into three categories: integers, one-hot vectors and word embeddings. Word embedding is a technique that represents words as dense, low-dimensional vectors in a continuous vector space. The main objective of word embeddings is to capture the semantic and contextual relationships between words. For instance, when visualizing word embeddings in 2D, similar words tend to be located close to each other.

Word Embedding Methods:

  • Continuous bag-of-words (CBOW): the model learns to predict the center word given some context words.
  • Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model learns to predict the words surrounding a given input word.
  • word2vec (Google, 2013): overcomes the limitations of BoW and TF-IDF by preserving contextual information and representing words in a dense vector space. It does not handle out-of-vocabulary (OOV) words well.
  • Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the corpus’s word co-occurrence matrix, similar to the count matrix you’ve used before.
  • Deep learning-based contextual embeddings include BERT and GPT.

The code provided demonstrates two ways of adding the embedding layer in a TensorFlow model. The first method is to use the Embedding layer. The second method is the use of TensorFlow Hub to build a neural network model using the Universal Sentence Encoder as a pre-trained embedding layer. By setting trainable=True, the layer parameters can be fine-tuned during training.

import tensorflow_hub as hub
import tensorflow as tf

#embedding method 1
model = tf.keras.Sequential([
# Add an Embedding layer with the correct parameters
# input_dim Integer. Size of the vocabulary, i.e. maximum integer index + 1.
# output_dim Integer. Dimension of the dense embedding.
# input_length Length of input sequences, when it is constant.
# 2D tensor with shape: (batch_size, input_length).
# 3D tensor with shape: (batch_size, input_length, output_dim).
tf.keras.layers.Embedding(input_dim=num_words, output_dim=embedding_dim, input_length=maxlen),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(265, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax'),
])

#embedding method 2
model = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=True, dtype=tf.string, input_shape=[]),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])

10. Bag-of-Words/TF-IDF

The bag-of-words approach represents text as an assortment or “bag” of individual words or tokens, disregarding their specific order or sequence. It creates a numerical representation of a document or corpus by tallying the occurrences of each word in the text. However, in the bag-of-words model, the original word order is discarded, and the focus is solely on the frequency of word occurrence.

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting scheme applied to the bag-of-words representation. Its purpose is to assign weights to words that reflect their importance within a document in the context of a larger collection of documents, known as a corpus.

TF-IDF takes into account two key factors:

1. Term Frequency (TF): This measures how frequently a term (word) appears in a document. It assigns a higher weight to words that occur more frequently within the document. The formula for TF is:

TF(t,d)= Number of times term t appears in document d / Total number of terms in document d

2. Inverse Document Frequency (IDF): This part measures how unique or rare a term is across all documents. It assigns a higher weight to words that appear less frequently across the corpus but provide more unique or informative content.

IDF(t,D) = log(Total number of documents in the corpus N​ / Number of documents containing term t+1)

3. TF-IDF Calculation: calculated by multiplying TF and IDF:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

If the TF-IDF value is high, it means the term is both common in the document and rare across the entire corpus, making it a distinctive feature of that document.

from sklearn.feature_extraction.text import TfidfVectorizer


documents = [
"The dedicated runner, after running for hours, finally ran across the finish line.",
"I enjoy running",
"I like to run in the morning"
]

vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix
for i in range(len(documents)):
print("Document:", i+1)
for j, feature in enumerate(feature_names):
tfidf_value = tfidf_vectors[i, j]
if tfidf_value != 0:
print(feature, ":", tfidf_value)

End note:

In summary, this NLP primer covered ten basic concepts, from tokenization and word embeddings to part-of-speech tagging and named entity recognition. This foundational understanding sets the stage for Part II, where we’ll explore practical NLP applications across diverse domains.

--

--