Building Blocks for AI Part-1: Vectorization, Similarity Detection, and Sentiment Analysis

Kamalmeet Singh
6 min readNov 8, 2023

--

In my previous post, I discussed the concept of configuring an AI chatbot with the capability to serve various purposes, ranging from assisting the development and support teams; to functioning as a self-help tool for customers. The success of the AI chat system relies heavily on two key elements: data vectorization and similarity detection, which enable the delivery of pertinent information to users.

Vectorization, similarity detection, classification, clustering, sentiment analysis, etc. form the fundamental building blocks of any AI-driven project roadmap.

image generated via OpenAI

In this post, I will cover three of these building blocks, namely vectorization, similarity detection, and sentiment analysis.

Vectorization

As demonstrated in the chatbot creation example, the conversion of text into vectors emerges as a vital component of any AI-related endeavor. Since AI models don’t operate directly on text, it becomes imperative to transform our textual data into a format that AI models can comprehend.

There are several good algorithms available for word-to-vector conversions.

Word2Vec: a popular word embedding algorithm developed by Google, is available in two primary variations: Continuous Bag of Words (CBOW) and Skip-gram. Word2Vec’s primary function is to acquire vector representations for words by either forecasting words based on their surrounding context or predicting the context of words when provided with a target word. Both Gensim and TensorFlow provide implementations for Word2Vec.

GloVe: short for “Global Vectors for Word Representation,” is a well-regarded word embedding algorithm that utilizes global word co-occurrence statistics to acquire word vectors. It is renowned for its proficiency in effectively capturing semantic relationships. Pre-trained GloVe embeddings are accessible in multiple dimensions and have been trained on extensive text corpora.

BERT: stands for Bidirectional Encoder Representations from Transformers, represents a cutting-edge contextual word embedding model rooted in the Transformer architecture. BERT’s proficiency in learning word embeddings is achieved through extensive training on large-scale text data. Its capability to grasp intricate contextual nuances has made it a popular choice for a wide range of Natural Language Processing (NLP) tasks, including sentiment analysis, text classification, and question-answering.

Sentence Transformers: a specific model type, has been tailored to produce vector representations for sentences, paragraphs, or more extensive textual segments. These embeddings are cultivated through training on extensive text corpora and excel at encapsulating the contextual meaning of the text, whether at the sentence or document level. Sentence Transformers find extensive application in a variety of natural language processing (NLP) tasks, showcasing their utility in activities such as gauging semantic similarity, detecting paraphrases, and performing text classification.

from sentence_transformers import SentenceTransformer, util

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# List of sentences you want to convert to embeddings
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third sentence."
]

# Compute embeddings for the sentences
sentence_embeddings = model.encode(sentences)

# Print the embeddings for each sentence
for sentence, embedding in zip(sentences, sentence_embeddings):
print(f"Sentence: {sentence}")
print(f"Embedding: {embedding}")
print("\n")

Similarity Detection

Another important aspect of AI is detecting similar objects. We have already seen one used in the ChatBot example where we used similarity detection to fetch relevant information. There can be many use cases like when s support person is looking for a similar incident to help the customer or a salesperson is looking for a similar lead to help in closing the deal.

Here are a few Similarity detection algorithms

Jaccard Similarity: the most basic form of similarity assessment, gauges the likeness between two documents by examining their text content. In essence, Jaccard similarity is a metric for measuring the similarity between two sets. This calculation involves determining the intersection size of the sets and dividing it by the union size of the sets. In text analysis, it is frequently employed to quantify the similarity between collections of words, as seen in document bag-of-words representations.

Euclidean Distance: Euclidean distance computes the straight-line distance between two points in a multi-dimensional space. In the context of text analysis, it can be used to measure the similarity between two documents represented as vectors.

Cosine Similarity: Alternate to Euclidean distance where we find straight line difference; we measure the cosine of the angle between two vectors in a vector space.

Hamming Distance: Hamming distance is used to measure the difference between two strings of equal length by counting the number of positions at which the corresponding characters differ. It’s often used for comparing binary sequences, such as in error detection or DNA sequence analysis.

TF-IDF (Term Frequency-Inverse Document Frequency) Similarity: TF-IDF is a weighting scheme used to measure the importance of terms in documents. TF-IDF similarity compares the TF-IDF representations of documents to determine their similarity.

We extend our previous SentenceTransformer example to calculate text similarity using Cosine Similarity.

from sentence_transformers import SentenceTransformer, util

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# List of sentences you want to convert to embeddings
sentences = [
"This is the first sentence.",
"Here's the second sentence.",
"And this is the third sentence."
]

# Compute embeddings for the sentences
sentence_embeddings = model.encode(sentences)

# Calculate cosine similarity between sentence pairs
similarities = util.pytorch_cos_sim(sentence_embeddings, sentence_embeddings)

# Print similarities between sentences
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
similarity = similarities[i][j]
print(f"Similarity between Sentence {i+1} and Sentence {j+1}: {similarity:.4f}")

Sentiment Analysis

Sentiment analysis, sometimes referred to as opinion mining, falls within the domain of natural language processing (NLP) and revolves around the task of discerning the sentiment or emotional tone conveyed within the text, such as reviews, tweets, or customer feedback. This process can be accomplished through machine learning (ML) approaches, including the utilization of pre-trained models that have already been trained on data tagged with positive or negative sentiments. For instance, terms like “Loved,” “Liked,” and “Helpful” can be associated with positive sentiments.

Here is an example using a pre-trained distilbert model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the DistilBERT tokenizer and the fine-tuned sentiment analysis model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Input text for sentiment analysis
text = "It was nice meeting you."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

# Perform inference
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Map the predicted class back to a sentiment label
sentiment_labels = ["Negative", "Positive"]
predicted_sentiment = sentiment_labels[predicted_class]

print(f"Predicted Sentiment: {predicted_sentiment}")

In this code, we’ve loaded the distilbert-base-uncased-finetuned-sst-2-english model for sentiment analysis. You can use this model to analyze the sentiment of the input text, and it's pre-trained on the SST-2 (Stanford Sentiment Treebank) dataset for English sentiment classification. Make sure to install the transformers library and other required dependencies to run this code.

Conclusion

When establishing an Artificial Intelligence and Machine Learning roadmap for your projects, it’s crucial to emphasize the significance of vectorization, similarity detection, and sentiment analysis as three foundational components. Proper vectorization techniques are paramount since they underpin the majority of AI/ML algorithms, necessitating a thorough comprehension of this concept. Similarity detection serves as another critical facet, enabling the identification of akin documents and entities within the system, ultimately aiding in the retrieval of pertinent information and informed decision-making. Lastly, sentiment analysis plays a pivotal role for companies by assessing whether the feedback or comments received convey a positive direction or otherwise.

All the code examples shared in this post are available in the following code repository.

--

--

Kamalmeet Singh

Tech Leader - Building scalable, secured, cloud-native, state of the art software products | Mentor | Author of 3 tech books |