Natural Language Processing (NLP) Zero to Mastery Part II: Common Applications
Sentiment analysis, topic modelling, text classification and text generation
The articles in this series cover the following topics:
- Part 1: Presents the fundamental principles and algorithms of Natural Language Processing (NLP).
- Part 2 (this article): Explores the common applications of NLP.
In this article, we explore various commonly encountered use cases in Natural Language Processing (NLP).
Application 1: Sentiment Analysis
Sentiment analysis is the use of natural language processing to systematically identify, extract, quantify, and study affective states and subjective information. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or another subject for some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event.
Apply SentimentIntensityAnalyzer to a dataset of 10,000 Amazon reviews. Like our movie reviews datasets, these are labelled as either “pos” or “neg”
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np
import pandas as pd
import nltk
nltk.downloader.download('vader_lexicon')
df = pd.read_csv('./Data/amazonreviews.tsv', sep='\t')
sid = SentimentIntensityAnalyzer()
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['sentiment'] = df['scores'].apply(lambda score: 'Positive' if score >= 0.05 else ('Neutral' if -0.05 <= score <= 0.05 else 'Negative'))
df
Application 2: Topic modelling
Next, we apply topic modelling to news articles. Topic modelling provides a way to automatically organize and understand large collections of text data by uncovering the latent themes or topics within them. When using Latent Dirichlet Allocation or Non-negative-matrix factorization, you need to pre-define the number of the class (n_components).
Latent Dirichlet Allocation with CountVectorizer
Latent Dirichlet Allocation (LDA) is a probabilistic generative model used for topic modelling in NLP. Let’s use NPR new articles to uncover latent topics. The output of LDA is a set of topics represented by probability distributions over words.
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
npr = pd.read_csv('./Data/npr.csv')
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Article'])
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)
topic_results = LDA.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)
npr
Non-negative-matrix factorization with TD-IDF
Non-negative Matrix Factorization (NMF) is a dimensionality reduction technique suitable for non-negative data like text, images, and audio signals. NMF decomposes an input matrix into two non-negative matrices: a basis matrix and a coefficient matrix. The objective is to minimize the reconstruction error by finding a low-rank approximation of the original matrix. We use the same NPR dataset to perform topic modelling.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
npr = pd.read_csv('./Data/npr.csv')
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = tfidf.fit_transform(npr['Article'])
nmf_model = NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)
topic_results = nmf_model.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)
npr
Application 3: Text classifier
Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) tasks for capturing sequential information in text data. RNNs maintain a hidden state that gets updated at each time step, allowing them to remember information from previous elements in the sequence. It’s important to note that plain RNNs may face challenges with long-term dependencies due to the “vanishing gradient” problem so the following variants are created:
- LSTM incorporates gating mechanisms that control information flow within the network.
- GRU is considered a simpler version of LSTM and is suitable for applications where sequence importance, faster results, and acceptable accuracy are prioritized.
We present an illustrative example of text classification techniques using Tensorflow — Bi-directional LSTMs and GloVe word embedding for text classification to understand the polarity of the tweet. The tweet dataset is provided in a csv file. Each row of this file contains the following values separated by commas.
import csv
import random
SENTIMENT_CSV = "./Data/training_cleaned.csv"
sentences = []
labels = []
with open(SENTIMENT_CSV, 'r', encoding='utf-8', errors='replace') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
# the text is 6th column
sentences.append(row[5])
#The labels are originally encoded as strings ('0' representing negative and '4' representing positive)
labels.append(1 if row[0] == '4' else 0)
sentences_and_labels = list(zip(sentences, labels))
# Perform random sampling
random.seed(42)
sentences_and_labels = random.sample(sentences_and_labels, 16000)
# Unpack back into separate lists
sentences, labels = zip(*sentences_and_labels)
train_size = int(len(sentences) * 0.9)
train_sentences = sentences[:train_size]
train_labels = labels[:train_size]
validation_sentences = sentences[train_size:]
validation_labels = labels[train_size:]
print(f"training example: {train_sentences[0]}\n")
print(f"training label: {train_labels[0]}")
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token= "<OOV>")
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
VOCAB_SIZE = len(word_index)
train_sequences = tokenizer.texts_to_sequences(train_sentences)
# maxlen: maximum length of all sequences
train_pad_trunc_seq = pad_sequences(train_sequences, padding='post',
truncating='post', maxlen=16)
val_sequences = tokenizer.texts_to_sequences(validation_sentences)
val_pad_trunc_seq = pad_sequences(val_sequences, padding='post',
truncating='post', maxlen=16)
train_labels = np.array(train_labels)
val_labels = np.array(validation_labels)
print(f"training example: {train_pad_trunc_seq}\n")
print(f"Padded and truncated training sequences have shape: {train_pad_trunc_seq.shape}\n")
print(f"Padded and truncated validation sequences have shape: {val_pad_trunc_seq.shape}")
import pickle
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from scipy.stats import linregress
# Define path to file containing the embeddings
# 100 dimension version of [GloVe] for each word (https://nlp.stanford.edu/projects/glove/) from Stanford.
GLOVE_FILE = './Data/glove.6B.100d.txt'
# Initialize an empty embeddings index dictionary
GLOVE_EMBEDDINGS = {}
# Read file and fill GLOVE_EMBEDDINGS with its contents
with open(GLOVE_FILE) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
GLOVE_EMBEDDINGS[word] = coefs
# VOCAB_SIZE = the length of tokenizer.word_index
EMBEDDINGS_MATRIX = np.zeros((VOCAB_SIZE+1, 100))
for word, i in word_index.items():
embedding_vector = GLOVE_EMBEDDINGS.get(word)
if embedding_vector is not None:
EMBEDDINGS_MATRIX[i] = embedding_vector
model = tf.keras.Sequential([
# Embedding layer:
# vocab_size (int): size of the vocabulary for the Embedding layer input
# embedding_dim (int): dimensionality of the Embedding layer output
# maxlen (int): length of the input sequences
# embeddings_matrix (array): predefined weights of the embeddings
tf.keras.layers.Embedding(VOCAB_SIZE+1, 100, input_length=16,\
weights=[EMBEDDINGS_MATRIX], trainable=False),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(train_pad_trunc_seq, train_labels, \
epochs=20, validation_data=(val_pad_trunc_seq, val_labels))
print(f"test example: {train_pad_trunc_seq[0:1]}")
print(f"test prediction: {model.predict(train_pad_trunc_seq[0:1])}")
Application 4: Text Generation
We will work on generating a new song using TensorFlow. The approach works very well until you have very large bodies of text with many words, such as assigning the one-hot label encodings to matrices with tens of thousands of elements. Alternatively, use character-based prediction — the full number of unique characters in a corpus is far less than the full number of unique words, at least in English.
from tensorflow.keras.preprocessing.text import Tokenizer
# Define the lyrics of the song
data="In the town of Athy one Jeremy Lanigan \n Battered away til he hadnt a pound. \nHis father died and made him a man again \n Left him a farm and ten acres of ground. \nHe gave a grand party for friends and relations \nWho didnt forget him when come to the wall, \nAnd if youll but listen Ill make your eyes glisten \nOf the rows and the ructions of Lanigans Ball. \nMyself to be sure got free invitation, \nFor all the nice girls and boys I might ask, \nAnd just in a minute both friends and relations \nWere dancing round merry as bees round a cask. \nJudy ODaly, that nice little milliner, \nShe tipped me a wink for to give her a call, \nAnd I soon arrived with Peggy McGilligan \nJust in time for Lanigans Ball. \nThere were lashings of punch and wine for the ladies, \nPotatoes and cakes; there was bacon and tea, \nThere were the Nolans, Dolans, OGradys \nCourting the girls and dancing away. \nSongs they went round as plenty as water, \nThe harp that once sounded in Taras old hall,\nSweet Nelly Gray and The Rat Catchers Daughter,\nAll singing together at Lanigans Ball. \nThey were doing all kinds of nonsensical polkas \nAll round the room in a whirligig. \nJulia and I, we banished their nonsense \nAnd tipped them the twist of a reel and a jig. \nAch mavrone, how the girls got all mad at me \nDanced til youd think the ceiling would fall. \nFor I spent three weeks at Brooks Academy \nLearning new steps for Lanigans Ball. \nThree long weeks I spent up in Dublin, \nThree long weeks to learn nothing at all,\n Three long weeks I spent up in Dublin, \nLearning new steps for Lanigans Ball. \nShe stepped out and I stepped in again, \nI stepped out and she stepped in again, \nShe stepped out and I stepped in again, \nLearning new steps for Lanigans Ball. \nBoys were all merry and the girls they were hearty \nAnd danced all around in couples and groups, \nTil an accident happened, young Terrance McCarthy \nPut his right leg through miss Finnertys hoops. \nPoor creature fainted and cried Meelia murther, \nCalled for her brothers and gathered them all. \nCarmody swore that hed go no further \nTil he had satisfaction at Lanigans Ball. \nIn the midst of the row miss Kerrigan fainted, \nHer cheeks at the same time as red as a rose. \nSome of the lads declared she was painted, \nShe took a small drop too much, I suppose. \nHer sweetheart, Ned Morgan, so powerful and able, \nWhen he saw his fair colleen stretched out by the wall, \nTore the left leg from under the table \nAnd smashed all the Chaneys at Lanigans Ball. \nBoys, oh boys, twas then there were runctions. \nMyself got a lick from big Phelim McHugh. \nI soon replied to his introduction \nAnd kicked up a terrible hullabaloo. \nOld Casey, the piper, was near being strangled. \nThey squeezed up his pipes, bellows, chanters and all. \nThe girls, in their ribbons, they got all entangled \nAnd that put an end to Lanigans Ball."
corpus = data.lower().split("\n")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(f'word index dictionary: {tokenizer.word_index}')
print(f'total words: {total_words}')
Tokenize each line and generate n-gram sequences, where n varies from 1 to the length of the line. Append each n-gram sequence to the input_sequences list.
#For example, if you only have one sentence: "I am using Tensorflow", you want the model to learn the next word given any subphrase of this sentence:
#INPUT LABEL
#-----------------------------
#I ---> am
#I am ---> using
#I am using ---> Tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences
input_sequences = []
# Loop over every line
for line in corpus:
# Tokenize the current line
token_list = tokenizer.texts_to_sequences([line])[0]
# Loop over the line several times to generate the subphrases
for i in range(1, len(token_list)):
# Generate the subphrase
n_gram_sequence = token_list[:i+1]
# Append the subphrase to the sequences list
input_sequences.append(n_gram_sequence)
max_sequence_len = max([len(x) for x in input_sequences])
max_sequence_len
import tensorflow as tf
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
print(f" xs shape: {xs.shape}")
print(F" xs preview example: \n {xs[0:11]}")
print(f" ys shape: {ys.shape}")
print(F" ys preview example: \n{ys[0:11]}")
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
# Build the model
model = Sequential([
Embedding(total_words, 64, input_length=max_sequence_len-1),
Bidirectional(LSTM(20)),
Dense(total_words, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
history = model.fit(xs, ys, epochs=500)
# Define seed text
seed_text = "Laurence went to Dublin"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
probabilities = model.predict(token_list)
predicted = np.argmax(probabilities, axis=-1)[0]
if predicted != 0:
output_word = tokenizer.index_word[predicted]
seed_text += " " + output_word
# generate songs
print(seed_text)
End note:
We initiate the process by employing sentiment analysis and topic modelling through nltk and sklearn decomposition techniques. Subsequently, we leverage TensorFlow deep learning models for complex tasks such as text classification and text generation. With the advent of transformer-based large language models (LLM) such as GPT, NLP applications have demonstrated remarkable performance. The exploration of the LLM concept is delved into more deeply in another article titled “Explore Generative AI and LLM: Unveiling Hugging Face, OpenAI’s GPT, and LangChain”.