Stories by Kavach Dheer on Medium

Detecting Pre-Training Data Leakage with Membership Inference

Kavach Dheer — Sat, 07 Jun 2025 18:33:20 GMT

How to tell if your target LLM was secretly trained on your test set

Introduction

In evaluating large language models (LLMs), a critical concern is whether a model has been inadvertently exposed to the very data we use for testing or benchmarking. If a model has memorized specific examples from its pre-training corpus, its test-time performance may overstate its true generalization ability. In this post, we’ll walk through a lightweight membership inference technique that flags whether a “target” LLM has seen a given dataset during pre-training, by comparing it against a “reference” model guaranteed not to have seen that data.

We’ll demonstrate this on the Amazon Review — Luxury Beauty dataset, using:

Target model: google/gemma-7b
Reference model: EleutherAI/pythia-6.9b-deduped (trained only on the Pile, which contains no Amazon reviews)

Because our reference model has never seen the Luxury Beauty reviews, any unusually high confidence that the target model shows — relative to the reference — signals possible memorization.

1. Core Idea: Likelihood Differential

At the heart of our approach is a simple premise:

If the target model was trained on a review text then it will assign that text a significantly higher probability (i.e., lower surprisal) than a model that never saw it.

We operate this via a normalized log-likelihood metric:

For each review text x, compute the sum of token log-probabilities under each model:

Measure each review’s “information content” by compressing x with zlib and taking the compressed length z(x), which approximates its entropy.

Define a normalized score per model:

The membership signal is the difference:

A large positive Δ\DeltaΔ suggests that the target model “knows” x much better than an unexposed model.

File One: Computing and Saving Δ Scores

Below is a Python script that scans through your gzipped JSON reviews, computes Δ\DeltaΔ for each one, and writes the results to a timestamped .txt file.

#!/usr/bin/env python3
import argparse, gzip, json, zlib, os
from datetime import datetime
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from tqdm import tqdm

def compute_sum_logprob(model, tokenizer, text, device, max_length=2048):
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=max_length).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    # shift for next-token log-probabilities
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = inputs["input_ids"][..., 1:].contiguous()
    log_probs = F.log_softmax(shift_logits, dim=-1)
    token_log_probs = log_probs.gather(-1, shift_labels.unsqueeze(-1)).squeeze(-1)
    return token_log_probs.sum().item()

def main():
    parser = argparse.ArgumentParser(
        description="Membership inference via Δ = logp / zlib_len"
    )
    parser.add_argument("--data",   type=str, default="Luxury_Beauty.json.gz")
    parser.add_argument("--limit",  type=int, default=0)
    parser.add_argument("--output", type=str, default="membership_results.csv")
    args = parser.parse_args()

    device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

    # --- Load target and reference models ---
    target_name = "google/gemma-7b"
    ref_name    = "EleutherAI/pythia-6.9b-deduped"
    quant_cfg = BitsAndBytesConfig(load_in_4bit=True,
                                   bnb_4bit_compute_dtype=torch.bfloat16)

    print(f"Loading target model ({target_name})…")
    tgt_tok = AutoTokenizer.from_pretrained(target_name)
    tgt_mdl = AutoModelForCausalLM.from_pretrained(
        target_name, quantization_config=quant_cfg, device_map={"": 1})

    print(f"Loading reference model ({ref_name})…")
    ref_tok = AutoTokenizer.from_pretrained(ref_name)
    ref_mdl = AutoModelForCausalLM.from_pretrained(
        ref_name,  quantization_config=quant_cfg, device_map={"": 1})

    # --- Read dataset ---
    print(f"Reading records from {args.data}…")
    with gzip.open(args.data, "rt", encoding="utf-8") as f:
        records = [json.loads(line) for line in f]
    if args.limit > 0:
        records = records[:args.limit]
    print(f"→ Will process {len(records)} reviews")

    # --- Compute Δ for each review ---
    results = []
    for rec in tqdm(records, desc="Processing reviews"):
        text = rec.get("reviewText", "").strip()
        if not text:
            continue

        zlen = len(zlib.compress(text.encode("utf-8")))
        sum_lp_tgt = compute_sum_logprob(tgt_mdl, tgt_tok, text, device)
        sum_lp_ref = compute_sum_logprob(ref_mdl, ref_tok, text, device)

        delta_tgt = sum_lp_tgt / zlen
        delta_ref = sum_lp_ref / zlen
        delta_diff = delta_tgt - delta_ref

        results.append({
            "reviewID":   rec.get("reviewerID"),
            "zlib_len":   zlen,
            "sum_logp_tgt": sum_lp_tgt,
            "sum_logp_ref": sum_lp_ref,
            "delta_tgt":  delta_tgt,
            "delta_ref":  delta_ref,
            "delta_diff": delta_diff,
        })

    # --- Save results with timestamp ---
    output_dir = '/home/kavach_d/.../results'
    os.makedirs(output_dir, exist_ok=True)
    timestamp = datetime.now().strftime("%Y:%m:%d_%H:%M")
    filename  = f"{timestamp}.txt"
    filepath  = os.path.join(output_dir, filename)
    with open(filepath, "w") as f:
        json.dump(results, f, indent=2)

    print(f"Wrote {len(results)} items to {filepath}")

if __name__ == "__main__":
    main()

Key Points

Normalization: Dividing log-probability by zlib-compressed length ensures fairness across text lengths and complexities.
Reference model requirement:
It is essential that the reference model never saw your dataset during pre-training. Here, we use EleutherAI’s Pythia-6.9B-deduped, trained solely on the Pile (which contains zero Amazon reviews), ensuring a clean baseline.
4-bit quantization via BitsAndBytes keeps memory and compute efficient.

Thresholding for Membership

Once you’ve computed Δ\DeltaΔ for each review, we can simply flag those with Δ\DeltaΔ above a chosen threshold as “members.”

import json
import pandas as pd

# 1. Load the JSON results
json_path = '.../results/gemma.txt'  # replace with your path
with open(json_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# 2. Create a DataFrame
df = pd.DataFrame(data)

# 3. Set your threshold
threshold = 0.01

# 4. Flag membership
df['is_member'] = df['delta_diff'] >= threshold

# 5. Print a summary with percentages
total       = len(df)
num_members = df['is_member'].sum()
pct_members = num_members / total * 100
pct_non     = 100 - pct_members

print(f"Threshold set at: {threshold}")
print(f"Total samples:      {total}")
print(f"Flagged as member:  {num_members} ({pct_members:.2f}%)")
print(f"Flagged as non-member: {total - num_members} ({pct_non:.2f}%)")

Threshold choice (0.01 here) may be tuned based on your desired false-positive/false-negative trade-off.
Change the threshold to 0.05, 0.001 and also see how they vary.
You could also plot graphs and find an optimal value of the Threshold
The final summary tells you what fraction of reviews appear memorized by the target model.

Conclusion

Membership inference based on likelihood differentials offers a straightforward, interpretable way to audit LLM pre-training. By comparing a suspect target model against a carefully chosen reference that cannot have seen the dataset, we obtain a clear signal of memorization. This methodology helps ensure robust, honest evaluations of model generalization — vital in both research and real-world deployments.

How to Fine-Tune Language Models with TensorFlow and Hugging Face

Kavach Dheer — Sat, 07 Jun 2025 17:54:15 GMT

How to Fine-Tune Language Models

In this article, we’ll walk through fine-tuning BERT on the MRPC (Microsoft Research Paraphrase Corpus) task using TensorFlow and Hugging Face’s Transformers library.

Why Fine-Tune?

Fine-tuning adjusts the model’s parameters with task-specific data, giving you a model that truly understands your application domain.

1. Setup and Imports

First, install the necessary libraries:

pip install transformers datasets tensorflow scikit-learn

Then, import the Python modules you’ll need:

import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from sklearn.metrics import accuracy_score, f1_score
from datasets import load_dataset

2. Load the Dataset

We’ll use the GLUE benchmark’s MRPC split, which contains pairs of sentences labeled as paraphrase or not.

raw_dataset = load_dataset('glue', 'mrpc')

3. Tokenization

BERT requires token IDs, attention masks, and token type IDs. We’ll use the bert-base-uncased checkpoint’s tokenizer to process both sentences in each example.

By using .map(), each batch of raw examples is passed through the tokenizer and the resulting token IDs, attention masks, and token type IDs are appended directly to the dataset without duplicating all other fields.

checkpoint = 'bert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example['sentence1'],
        example['sentence2'],
        truncation=True
    )

tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

4. Padding and Batching

To feed data into TensorFlow, we need uniform sequence lengths per batch. The DataCollatorWithPadding will pad on the fly.

Using a collator means padding is applied per batch at runtime, rather than padding every example in advance to the same length. This dynamic padding further reduces memory overhead by ensuring sequences are only as long as the longest example in each batch.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='tf')

tf_train_dataset = tokenized_dataset['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'token_type_ids'],
    label_cols=['label'],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_dataset['validation'].to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'token_type_ids'],
    label_cols=['label'],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

5. Model Initialization

We load TFAutoModelForSequenceClassification with two output labels (paraphrase vs. non-paraphrase).

model = TFAutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2
)

6. Compile and Train

Choosing the right optimizer and learning rate is crucial. Here we use the default Adam with a sparse categorical cross-entropy loss (since labels are integer IDs).

model.compile(
    optimizer='adam',
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=5,
)

Tip: Start with a small learning rate (e.g., 2e-5) and consider using learning rate schedules or warm-up — this often improves stability and final performance.

7. Evaluation

After training, we evaluate on the validation set:

# 1. Get raw logits
logits = model.predict(tf_validation_dataset).logits

# 2. Convert to predicted class IDs
class_preds = np.argmax(logits, axis=-1)

# 3. Compute metrics
refs = raw_dataset["validation"]["label"]
print({
    "accuracy": accuracy_score(refs, class_preds),
    "f1":       f1_score(refs, class_preds)
})

Conclusion

Fine-tuning lets you leverage cutting-edge models with minimal coding. With just a few lines of code, you can adapt BERT (or any Transformer) to your task, unlocking powerful language understanding. Experiment with hyperparameters, regularization, and training strategies to push performance even further.

Happy fine-tuning!

The Birth and Evolution of CNNs: From ImageNet to Human-Level Performance

Kavach Dheer — Sun, 02 Feb 2025 19:47:13 GMT

In this article, I will explain the founding blocks of CNN: the story behind it, the model that powers it, the people who created it, how it came into existence, and how it transformed the world of deep learning.

Part -1 Vision and Creation of Dataset

So the story starts in 2006. At that time, all the researchers were working on the models and algorithms of deep learning. Then a woman named Fei-Fei Li enters and decides to work on creating a dataset, because as we all know, in order for deep learning to perform really well, we need datasets. Everybody was trying to optimize and create algorithms, but we needed data, and nobody was working on that, and we didn’t have a dataset. Fei-Fei Li saw a larger vision — she knew even if we have these models and algorithms, we needed data to train and test them. So she started to work on creating a dataset called ImageNet.

It took her 2.5 years to create the dataset, and in 2009, she and her team completed it.

The creation of this dataset was phenomenal, and this was one of the biggest datasets. Indeed, to create this dataset, she and her team had to go through several challenges.

Scale: It contains over 14 million images that have been hand-annotated to indicate what objects are pictured. The database has more than 20,000 categories or “synsets.”
Creation Process: The images were collected from the web and labeled by human workers through Amazon’s Mechanical Turk crowdsourcing platform. Each image was labeled by multiple workers to ensure accuracy.
Organization: The database is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset.”

Part -2 Vison of ILSVRC

I strongly believe humans are competitive by nature — they tend to break their limitations when competing with each other. Olympics, racing, and sports are prime examples. Even in daily life, humans constantly compete, whether lifting heavier weights than friends at the gym or striving to outperform colleagues at work. The examples are endless.

Similarly, from 2006 to 2009, Netflix held a competition in machine learning. They offered a $1 million prize to anyone who could improve their recommendation system’s accuracy by 10%. The winning solution used matrix factorization techniques, which became a foundational approach in recommendation systems and revolutionized the field. It was this competitive spirit that transformed the field of recommender systems.

Similary, viewing this Fei Fei Li thougt of creating a similar competiton, because now we had the dataset, but we needed to to create models to test and create them.

So Fei Fei li and her team launched the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2010. ILSVRC was an annual competition that used a subset of ImageNet data to test and compare computer vision algorithms. The challenge focused on:

Image Classification: Algorithms had to classify images into 1000 different object categories
Object Detection: Models needed to identify and locate specific objects within images
Object Localization: Participants had to precisely indicate where objects appeared in images using bounding boxes.

Part -3 Entry of Geoferry Hilton with the disruption of CNN

In the first two years of the competition, researchers used machine learning models, achieving error rates of 28% in 2010 and 25% in 2011.

The breakthrough came in 2012 when Geoffrey Hinton entered the competition. (If you don’t know who he is, you might want to brush up on your deep learning history!)

Hinton introduced a revolutionary architecture called AlexNET( Image of architecture attached at bottom), which was groundbreaking for three reasons: it used GPU to train the models, implemented relu as the activation function, and pioneered the use of CNN.

He dramatically reduced the error rate to 16%, marking the moment when CNN took its birth. Everyone’s attention turned to CNNs, amazed by how this model significantly outperformed traditional machine learning approaches.

From that point forward, there was no looking back. CNNs revolutionized the field of deep learning, fulfilling Fei-Fei Li’s vision and validating her hard work.

The competition continued, with new CNN architectures consistently lowering the error rate:

2013 — ZFNet — 11.7%

2014 — VGG — 7.3%

2015 — GoogleNet — 6.7%

The competition concluded with ResNet in 2016 achieving a remarkable 3.5% error rate, surpassing human-level performance of 5%.

If you liked this article give it a Like, if not write in the comment section how I can improve more.

AlexNET

Multithreading

Kavach Dheer — Mon, 27 Jan 2025 12:42:00 GMT

A one stop guide for Multithreading

Definations

1. Processes

A process represents a program in execution. It is an independent entity that has its own memory space (stack, heap, etc.) and system resources.
A process is managed by the operating system, and it runs sequentially unless it’s specifically designed to utilize multiple threads.

2. Threads

Threads are indeed considered “lightweight processes” because they share the same memory space and resources of their parent process.
Each thread has its own execution stack and program counter, but it operates within the context of a process. This shared memory allows for easier data exchange between threads, but it also introduces potential challenges, like race conditions.

3. Multithreading

Multithreading occurs when multiple threads are running within a single process. These threads can execute concurrently, depending on the CPU and the threading model.
The main advantage of multithreading is the ability to perform tasks in parallel, improving efficiency and responsiveness in programs. For instance:
One thread might handle user input while another processes data or communicates with a server.

Typical Workflow

1. A process creates the environment:

A process is the container that holds everything needed for a program to run, such as the code, data, and system resources (like memory, file handles, etc.).
Think of a process as the “house” that provides the space and utilities for threads to “live and work.”

2. Threads work within the process:

Threads are the “workers” inside the process. They share the process’s resources (like memory, files, and variables) but have their own execution context (like program counter, stack, and registers).
Each thread can execute independently, but since they share the same memory, they can easily communicate with each other.

3. Multithreading:

When a process has multiple threads, those threads can run concurrently (or in parallel on multi-core CPUs).
For example, in a web browser (a process), one thread might render the webpage, another might handle user input, and a third might manage background network requests — all happening in the same process.

Parallel Computing

1. Parallel Computing at the Process Level

On a multi-core CPU, multiple processes can run in parallel at the same time.

For example:

Core 1: Running Process A
Core 2: Running Process B
Core 3: Running Process C
Core 4: Running Process D
Each process runs independently and may or may not have threads inside it.

2. Parallel Computing at the Thread Level (Multithreading)

Inside a single process, multithreading allows different threads to run concurrently or in parallel.
On a multi-core CPU, threads from the same process can run in parallel on separate cores.

For example:

Process A (on 4 threads):
Core 1: Thread 1 of Process A
Core 2: Thread 2 of Process A
Core 3: Thread 3 of Process A
Core 4: Thread 4 of Process A
On a single-core CPU, multithreading does not achieve true parallelism but instead uses time slicing to switch between threads very quickly, creating the illusion of parallelism (this is called concurrency).

3. Can Parallel Computing Be Achieved with Just One Process?

Yes, a single process can achieve parallel computing using multithreading, but only if:

The CPU has multiple cores to run threads simultaneously.
The threads are properly written to divide the workload across those cores.
If the CPU has only one core, the threads will run one after another (time slicing), and it won’t be true parallel computing, but rather concurrency.

Key Takeaway

To achieve parallel computing:

Multiple processes can run in parallel on multiple CPU cores.
Inside a single process, multithreading allows parallelism if the CPU supports it (multi-core).

Overview of Sentiment Analysis with Large language model

Kavach Dheer — Fri, 27 Sep 2024 12:44:59 GMT

This article explores overview of Sentiment Analysis, covering its definition and types. We’ll start by examining what Sentiment Analysis is and its various types. Finally , we’ll delve into Sentiment Analysis using Large Language Models.

Introduction

Sentiment analysis is also called as Opinion analysis or Opinion mining.Several real-world applications require sentiment analysis for detailed investigation. for example, product analysis, discover which components or qualities of a product appeal to customers in terms of product quality.

Sentiment analysis for various applications like reputation management, market research, and competitor analysis, product analysis, customer voice, etc. Various issues are associated with sentiment analysis and natural language processing, such as individuals informal writing style, sarcasm, irony, and language-specific challenges. There are many words in different languages whose meaning and orientation change depending on the context and domain in which they are employed. Therefore, there are not many tools and resources available for all the languages. Sarcasm and irony are two of the most critical challenges that have recently attracted the attention of researchers. There has been much development in detecting sarcasm and irony in text. There are many challenges in sentiment analysis.

Types of Sentiment Analysis

Sentiment analysis has been investigated on several levels: Document Level, Sentence Level, Phrase Level, and Aspect Level. Sentiment analysis in each level such as document, sentence and phrase, aspect level shown in Fig. 1.

Document level sentiment analysis

Document-level: Document level sentiment analysis is performed on a whole document, and single polarity is given to the whole document. This type of sentiment analysis is not used a lot. It can be used to classify chapters or pages of a book as positive, negative, or neutral. At this level, both supervised and unsupervised learning approaches can be utilized to classify the document

Sentence level sentiment analysis

Sentence level: In this level of analysis, each sentence is analyzed and finding with a corresponding polarity. This is highly useful when a document has a wide range and mix of sentiments associated with it (Yang and Cardie 2014). This classification level is associated with subjective classification (Rao et al. 2018). Each sentence polarity will be determined independently using the same methodologies as the document level but with greater training data and processing resources. The polarity of each sentence may be aggregated to find the sentiment of the document or used individually.

Phrase level sentiment analysis

Phrase level: Sentiment analysis also be performed where opinion words are mined at phrase level, and classification will be done. Each phrase may contain multiple aspects or single aspects. This may be useful product reviews of multiple lines; here, it is observed that a single aspect is expressed in a phrase.

Aspect level sentiment analysis

Aspect-level sentiment analysis takes words (and phrases) from text to identify specific aspects or features being discussed, and then it determines the sentiment (positive, negative, or neutral) associated with those aspects.

How it works:

Identify the aspects (features) from the text:

For example, in a sentence like “The screen is bright, but the sound quality is poor,” the words “screen” and “sound quality” are recognized as aspects.

Determine the sentiment for each aspect:

The sentiment associated with “screen” is positive (because of the word “bright”).

The sentiment associated with “sound quality” is negative (because of the word “poor”).

In this way, aspect-level sentiment analysis goes beyond just detecting the overall tone of the text and identifies opinions related to specific words or aspects mentioned.

Feature Selection

To develop a classification model it requires first identifying relevant features in dataset

Emoji are facial expressions used in sentiment analysis to convey emotions.
Punctuation marks, or exclamation marks, serve to highlight the force of a positive or negative remark. Similarly, the apostrophe and the question mark are other punctuation marks.
Words in slang, such as lol and rofl. These are frequently used to introduce a sense of humor into a remark.
Punctuation marks, like exclamation marks, serve to highlight the force of a positive or negative remark. Similarly, the apostrophe and the question mark are other punctuation marks.

Feature Extraction

Feature extraction is a key task in sentiment classification as it involves the extraction of valuable information from the text data, and it will directly impact the performance of the model.

Negations These are the words that can change or reverse the polarity of the opinion and shift the meaning of a sentence. Commonly used negation words include not, cannot, neither, never, nowhere, none, etc. Every word appearing in the sentence will not reverse the polarity; therefore, removing all negation words from stop-words may increase the computational cost and decrease the model’s accuracy. Negation words must be handled with at most care (George et al. 2013). Negation words such as not, neither, nor, and so on are critical for sentiment analysis since they can revert the polarity of a given phrase. For instance, the line “This movie is good.” is a positive sentence, but “The movie is not good.” is a negative sentence. Regrettably, some systems eliminate negation words because they are included in stop word lists or are implicitly omitted since they have a neutral sentiment value in a lexicon and do not affect the absolute polarity. However, reversing the polarity is not straight forward because negation words might occur in a sentence without affecting the text’s emotion.

Bag of Words (BoW) BoW is one of the simplest approach for extracting text features. BoW will describe the occurrence of words in a document. Bag represents the vocabulary of words using which a vector is formed for each sentence. The main problem with this model is that it does not consider the syntactic meaning of the text. For instance, consider two sentences s1= “the food was good”, s2= “the service was bad”. The vocabulary is created for two sentences where v= {’the’, ‘food’, ‘was’, ‘service’, ‘bad’, ‘good’} and the length of the vector is 6 and is represented as v1= [ 1 1 1 0 0 1] and v2= [1 0 1 1 1 0]. BoW approach performance evaluated using (TF-IDF) which performs better in most cases.

Word Embedding

Word embeddings represent words in a vector space by clustering words with similar meanings together. Each word is assigned to a vector, which is then learned in a manner similar to neural networks. It learns and chooses a vector from a predetermined vocabulary. The dimension of the words may be chosen by passing it as a hyperparameter. SG model and the continuous CBOW model are two of the most well-known algorithms for word embeddings. Both of these are shallow window approaches methods in which a short window of some size, such as four or six, is specified, and the current word is anticipated using context words in CBOW, while context words are forecasted using the current word in the SG model. Word embeddings are concerned with learning about words in the context of their local usage, which is specified by a window of nearby terms.

Word2vec word2vec is a 2-layer neural network that is used for vectorizing the tokens. It is one of the famous and widely used vectorizing techniques developed by Mikolov et al. (2013). Word2vec mainly has two models CBOW and SG. The CBOW model predicts the target word using context words, whereas the SG model predicts the target word using context words. With a larger dataset, the SG model performs better. Global Vectors (GloVe) Global Vectors for word representation have developed (Pennington et al. 2014) by an unsupervised learning approach to generate word embeddings from a corpus word-to-word co-occurrence matrix. GloVe is a popularly used method as it is straightforward and quick to train GloVe model because of its parallel implementation capacity (Al Amrani et al. 2018).

Fast Text It is an open-source and free library developed by FAIR (Facebook AI Research) mainly used for word classifications, vectorization, and creation of word embeddings. It uses a linear classifier to train the model, which is very fast in training the model (Bojanowski et al. 2017). It supports a CBOW and SG model. Semantic similarities may be found using this model.

ELMo ELMo is a deep contextualized text representation. ELMo contributes to overcoming the limitations of conventional word embedding approaches such as LSA, TF-IDF and n-grams models (Peng et al. 2019). ELMo generates embeddings to words based on the contexts in which they are used to record the word meaning and retrieve additional contextual information. Through pretraining, ELMo can more accurately represent polysemous words in a variety of contexts and is more informative about the text’s higher-level semantics (Ling et al. 2020).

Task of Sentiment Analysis

Overview of the tasks various tasks of sentiment analysis is shown in the figure 2

Subjectivity Classification is a Natural Language Processing (NLP) task that aims to classify a piece of text as either subjective or objective. The goal is to determine whether a text expresses personal opinions, feelings, and beliefs (subjective) or factual, neutral information (objective).Key Concepts: Subjective Text Definition: Subjective text expresses personal opinions, judgments, emotions, or viewpoints. It is non-factual and influenced by individual perceptions or feelings. Example: “The camera quality is amazing” is subjective because it conveys an opinion or personal experience.Objective Text Definition: Objective text provides factual information, usually neutral, verifiable, and not influenced by personal feelings or opinions. Example: “The camera has a 12-megapixel sensor” is objective because it presents a measurable fact.
Sentiment Classification is a Natural Language Processing (NLP) task that involves determining the emotional tone or sentiment expressed in a piece of text. It classifies the text into predefined categories of sentiment, such as positive, negative, or neutral. This classification is often used in applications like product reviews, social media analysis, and customer feedback to understand users’ opinions or emotions about a particular subject.
Opinion Spam Detection is the process of identifying and filtering out deceptive, fake, or manipulative opinions (e.g., product reviews, ratings, comments) that are intended to mislead potential consumers or distort the reputation of a product, service, or organization. These “spam” opinions are crafted to either promote or demote an item artificially and are a growing issue, especially in platforms like Amazon, Yelp, TripAdvisor, and social media.
Implicit Language Detection: Sarcasm, irony, and humor are generally referred to as Implicit Languages.
Aspect Extraction is a key task in Natural Language Processing (NLP) that focuses on identifying specific components, features, or attributes of a product or service mentioned in text. This process is crucial for understanding opinions expressed in reviews, feedback, or discussions, particularly in the context of aspect-based sentiment analysis.Key Concepts:
Aspects: Aspects are specific features or attributes of a product, service, or entity that users express opinions about. For example, in a restaurant review, aspects could include food quality, service, ambiance, and price. In a smartphone review, aspects might be battery life, camera quality, screen resolution, etc.
Example:
Review: “The food at the restaurant was delicious, but the service was slow.”
Identified Aspects:

Food: Associated sentiment is positive (delicious).
Service: Associated sentiment is negative (slow).

Methodology

Three mainly used approaches for Sentiment Analysis include Lexicon Based Approach, Machine Learning Approach, and Hybrid Approach, deep learning. In addition, researchers are continuously trying to figure out better ways to accomplish the task with better accuracy and lower computational cost.

Neutal Network

RNN (Donkers et al. 2017) have proven to improve results when trained on sufficient data and computations. Variants of RNN (Pham and Le-Hong 2017) like LSTM (Bandara et al. 2020), GRU (Cheng et al. 2020), Bi-LSTM (Abid et al. 2019; Cho and Lee 2019) have been used extensively in Sentiment analysis and related NLP task (Abid et al. 2019; Khan et al. 2016). Attention models are being introduced recently, which gives models an edge over another model. Recent transfer learning techniques using BERT (Devlin et al. 2018) and GPT (Ethayarajh 2019) are gaining the attention of researchers as the model is already trained on a massive corpus for days on high-end GPU and Super computers. Weights can be fine-tuned using the training dataset to get accurate results. Deep learning-based techniques are becoming highly popular due to their outstanding performance in recent times.

Aspect Based Sentiment Analysis

ASBA is valuable and rapidly growing part of sentiment analysis that has gained prominence in recent years. Three critical phases compose aspect-level sentiment analysis: aspect detection, polarity or sentiment categorization, and aggregation.

Aspect level sentiment analysis is most popular among product reviews or hotel reviews, as this approach will help them identify various aspects focused by the review writers and help them rectify aspects that have a negative sentiment.

Complex algorithms like LSTM, Bi-LSTM or pre-trained models like BERT, GPT-2 may be used to accomplish the task. The researchers avoid vanilla RNN as it faces many problems like vanishing and exploding gradient descent.

Transfer Learning

Transfer learning is one of the advances techniques in AI, where a pre-trained model can use its acquired knowledge to transfer to a new model. Transfer learning uses the similarity of data, distribution, and task. The new model directly uses the previously learned features without needing any explicit training data. Training data may be used to fine-tune to the model to a new task.

In 2018, Google AI Language Researchers open-sourced a new model for NLP called BERT. It has a breakthrough and has taken the industry of deep learning by storm due to its performance. In the work of Han et al. (2021) Transformer network revolutionized the area of NLP and replaced the usage of LSTM and Bi-LSTM. The main advantage is that Transformers do not suffer from vanishing or exploding gradient problems as they do not use recurrence at all, and also, they are faster and less expensive to train. BERT is an extension of the Transformers model proposed (Vaswani et al. 2017) in the “Attention is all you need” paper. BERT uses transformers, an attention mechanism that learns contextual relationships between words or sub-words in a given text. The input in this model contains the word embeddings and position embeddings, unlike transformers, but also has an extra vector representing the sentence it belongs to handle two or more sentences at a time. BERT consists of encoders based transformers; the encoder part is similar to the transformer encoder. BERT has two models BERT base with 12 encoders stacked with 110 million parameters and BERT large model with 24 encoders stacked with 330 million parameters. BERT model trained in two stages pre-training and fine-tuning. This is the model main advantage as the fine-tuning with the dataset can be done as per the task.

Large Language Models

LLMs’ Performance on Text Length:LLMs tend to perform better on longer texts using zero-short training due to thier pre-training on vast datasets.However, performance may decline with shorter texts or informal language, suggesting zero-shot training alone is insufficient for all tasks. While training an LLM on domain-specific tasks can yield better results, it’s often too costly and resource-intensive. Comparatively, we can achieve similar or better results using aspect-based sentiment analysis on movie reviews.
Challenges with Informal Language:
Slang, sarcasm, and punctuation pose significant challenges for LLMs.|
Movies are inherently subjective, and reviews often employ metaphors, sarcasm, etc. This complexity could pose challenges for LLMs
Zero-shot capability is promising but not a universal solution:
LLMs can analyze sentiment without specific training data: This zero-shot capability is advantageous when labeled review datasets are scarce.
Performance may lag behind fine-tuned models: While convenient, zero-shot LLM performance might not always match the accuracy of models specifically trained on labeled review data, especially in specialized domains
Aspect based Sentiment Analysis in LLM hasn’t been reserched yet properly, whether zero-short LLM performs better or worst is still unclear.

When to use LLM for sentiment analysis

Considerations for choosing LLMs for review analysis:

Nature of the reviews: Consider the source (social media vs. dedicated review platforms), length, and domain specificity.
Need for aspect-based analysis: Assess if understanding sentiment towards specific aspects is necessary.
Availability of labeled data: If labeled review data is scarce, zero-shot or few-shot LLM approaches might be suitable, while ample data could favor fine-tuned models.
Importance of explainability: If understanding the reasoning behind sentiment predictions is crucial, LLMs’ explainability features are advantageous.

The resource I used to write this article is

Wankhade, M., Rao, A.C.S. and Kulkarni, C., 2022. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7), pp.5731–5780.

Analysis of an Ethical Case Study on Chatbot “Tay” created by Microsoft

Kavach Dheer — Wed, 28 Aug 2024 17:05:50 GMT

1. INTRODUCTION TO THE CASE

[1] A chatbot is an AI program that simulates a conversation between a system and a user using NLP techniques. Chatbots can better grasp normal language thanks to natural language processing, which also helps them produce thoughtful responses.

Microsoft on March 23, 2016 released an artificial intelligence chat bot named [2] “Tay” as an acronym for “thinking about you” on Twitter under the name “TayTweets” and handle “@TayandYou”.

The goal of the Microsoft team was to have an chatbot that is both useful and empathetic, and helps the society as a whole.Therefore,[3]Tay was created to emulate the speech patterns of a 19-year old American woman and to pick up new vocabulary through communication with other Twitter users.Moreover, the bot was designed to reply to other Twitter users and it also had the ability to caption photos provided to it into a form of a Internet memes.

The chatbot was shut-down by the Microsoft in less than [4] 2 days because within 24 hours the chatbot had gained more than 50,000 followers and produced nearly 100,000 tweets in which most of them were tremendously racist and offensive.

Microsoft created three more versions after this but all them faced the same issue they were racist and offensive because of which they were scrutinised by media and they hurted the sentiments to a lot of communities and people.Moreover, while some later version only deflected to answer the questions related to ethical values but in the end they were all shut down without solving the real issue.

2. A REVIEW OF THE LITERATURE ON THE SUBJECT

There is a famous saying in Artificial Intelligence Community i.e “Garbage in Garbage Out”, It means that the kind of data we provide to the model it will create the results based on that.

The team of Microsoft created the chatbot using the data that was not properly processed and analysed and because of which within hours of release Tay absorbed sufficient information from the Twitter-verse to develop into a despicable tweeter, shamelessly expressing a variety of unfounded and unimaginable prejudices.

Microsoft instead of solving the root cause which was the chatbot was trained on unfiltered data, solving that and providing a data which is not offensive and has prejudice they only made the chatbot to avoid giving answers when asked questions based on political and ethical values

3. Lifficks’s Analysis of the Case

Main participants and Actions

3.1. Participants in the Primary Process

Engineers :-They were aware of the system flaw after the first version failed miserably, but instead of fixing the root problem of training the model on furnished data.They attempted to sweep the problems under the rug by making the model deflect certain questions.
PR Team :-The PR team was changing the narrative of the people by making everyone believe that [5] Tay only imitated what it observed from other people rather than having been programmed to be racist or fascist.
Managers :-After failing the initial versions of the chatbots the managers should have cross-checked with the engineering team and taken an action to improve in the future version, unfortunately all the 4 versions faced the same issue.

3.2. Participants in the Secondary Process :-

Eithnicity Group :- Tay made a lot of racist tweets to people from different ethnicities.
Political Views :-It made a lot of negative political jokes such as on Donald Trump.
Gender Biased :-It was very much gender biased towards men and made jokes on feminism which hurted the sentiments of a lot of people.

3.3. Participants who are implied :-

Engineering Team (Primary) :-They developed the Chatbot but never fixed the issues in the later versions.
People on Twitter(Secondary) :-It hurted the sentiments of all the people and communities on Twitter.

3.4. Reduced List

Engineers :The algorithms were created by this group, and they unwittingly used biased and prejudice data to train the program and did not fix the root cause in the later versions.

3.5. Legal Considerations

Statue of Limitation, 1957, Section 72 :-The company benefited from these parts because it was experimenting with a technology whose behaviour was not entirely predictable and was not aware of bias and prejudice in the training data.
General Data Protection Regulation[6] :-The GDPR mandates that businesses obtain applicants’ consent before processing their sensitive data, which is one of several key points. However, the business must be open about how it uses the data and how long it intends to keep it. Both of these appear to have been broken in this instance because the report shows that Twitter users was unaware that they were taking part in a trial and helping the bot learn and train more.

3.6. Possible Options for Participants :-

The Company:Could have researched implications and possible challenges on releasing the chatbot on Twitter.
Could have tested the chatbot within the company before releasing it on the Twitter.
Engineers :-Statistical study on the training data set might have been possible. They might have been alerted to the data skewness and prejudice by such study.
After failing the initial versions should have fixed the root problems rather than each time only restricting the chatbot to utilise its full potential.

3.7. Possible Justifications for Actions

The Company :-According to the report, the corporation reportedly started this as an experiment and learned little about issues relating to privacy, morality, or ethics.
Engineers :-Engineers also viewed this as an experiment to see if it was possible to create a chatbot using AI and NLP, therefore they didn’t design and build the system in a way suitable for “production” by conducting adequate study and testing.

3.8. Key Statements

“… tweets were highly offensive. In less than 16 hours Tay had turned into a brazen anti-Semite and was taken offline for re-tooling”
“… Tay said: ‘The Nazis were right”
“…They need to be reliable to misuse too”
“…flaming garbage pile in, flaming garbage pile out.”
“…Tay said:I fucking hate feminists they should all die and burn in hell”
“… Tay said: Hitler was right I hate the jews”
“… Tay has been built using “relevant public data” that has been “modeled, cleaned, and filtered,” but it seems that after the chatbot went live filtering went out the window”

Ethical Questions Raised

Has the company thought about the project’s moral and ethical ramifications ?
Did the data used to train the AI system was examined by engineers ? Did the engineers test the system properly ?
Did the Engineers fix the chatbot from being racist and prejudice ?

3.10. Analogies Employed

Google Vision AI :-[7] In 2015, Google came under fire for an image-recognition system that automatically labeled images of black individuals as “gorillas.”
Amazon AI Recruiting Tool :-[8]Amazon built AI hiring tool but it was gender biased as it favoured men more than women.

3.11. Comparison with the new ACM code of Ethics

Following is a list of ACM code of ethics that the organisation did not consider [9]

1.2 Avoid harm

1.4 Be fair and take action not to discriminate

2.1 Strive to achieve high quality in both the processes and products of professional work.

2.5 Give comprehensive and thorough evaluations of computer systems and their impacts, including analysis of possible risks.

3.1 Ensure that the public good is the central concern during all professional computing work.

3.2 Articulate, encourage acceptance of, and evaluate fulfilment of social responsibilities by members of the organisation or group.

3.6 Use care when modifying or retiring systems.

4. Alternative Proposals:

Pessimistic: Company could wait for the AI technology in NLP to improve so that it can be a very efficient chatbot. Meanwhile, use pre-generated answering chatbots without the use of AI.
Optimistic: Organisation could optimise the AI system provide it a good training data and do design modification.
Compromise: Make a hybrid chatbot using both the AI and traditional chatbots without the use of AI.

Conclusion

We have been trying to develop chatbots from 1994 [10], the traditional chatbots were created using pre-generated answers but now with great computational power and the advantage of both neural networks & NLP we can create AI chatbots which are intelligent enough to generate the answer on its own.Unfortunately, with great power comes great responsibility,In order to create a very efficient chatbot we need to train the chatbot on a dataset which is not biased and not of prejudice.Moreover, before releasing it to public we should test it thoroughly so that it performs decently unlike the chatbot Tay which failed miserably and hurted the sentiments of so many people and communities.

Phd a Music Symphony

Kavach Dheer — Fri, 09 Aug 2024 08:19:28 GMT

It’s been one year into my PhD, and whenever I meet non-PhD people, they always ask how a PhD is and what exactly it involves. I always struggled to give the most appropriate answer.

After months of pondering, I think I finally have an answer.

I feel PhD students are like music artists.

For example, a musician first brainstorms and thinks about the idea, subject, or theme of the song. Once they figure that out, they start experimenting with different melodies, spending a lot of time finding the perfect one. Once they have the perfect melody, they spend time writing lyrics to match it.
They also collaborate with other musicians, brainstorming on how to complete the jigsaw and make everything sound perfect.
After working on it for months, they finally produce the song. They then have to show that music to the record company, and if they like it, they publish it.
Once the song is out there, it is there for a lifetime.

Now, let’s understand how this is very similar to a PhD.

PhD students also start by brainstorming an idea, subject, or theme on which their PhD will revolve. Once they figure that out, they start experiments to support their idea. A PhD student spends a lot of time, similar to a musician finding the perfect melody, to get the experiment working.
Once the PhD student is successful in their experiment, they spend time writing a paper for it, similar to how musicians spend time writing lyrics.
PhD students also work in groups or solo, just like musicians, each bringing their specialty.
Once the student has worked on the experiments and written the paper, they need to publish it in a journal or conference. These also have rankings; the higher the rank, the more difficult it is to publish, but it’s also highly regarded. Similarly, musicians need to publish their song with a music label company, and it’s difficult to release it with a highly regarded one.

Once the music is out, it is there for a lifetime. Even when the artist dies, we still listen to their songs and remember them. For example, we still listen to Michael Jackson’s songs.
Similarly, once the research is published, we still read about it even when the author has died.

I strongly feel a PhD is like a song, and we are the musicians.

Recommender Systems in Large Langauge Models

Kavach Dheer — Tue, 18 Jun 2024 12:04:10 GMT

Up until 2018, transfer learning was only used in the Computer Vision domain for tasks such as object detection, classification, and segmentation. These models were not trained from scratch, but instead were fine-tuned from models that had been pretrained on ImageNet, MS-COCO, and other datasets. In 2018, [1] suggested that transfer learning could be utilized in NLP as well. Google and OpenAI saw this as an opportunity, and within one year, both of them came up with their pre-trained models such as BERT and GPT. Since then, there has been no turning back. The research and developments in the field of NLP skyrocketed, from Text-classification to Conversational AI chatbots like ChatGpt, revolutionizing research and development in Natural Language Processing. Each year, companies expand the parameters and training corpus of their large language models (LLMs). As these models grow larger, they improve in multiple ways:

Their language understanding and generation become more human-like.
Their generalization capabilities improve, allowing them to be used for various downstream tasks without fine-tuning •
LLMs can provide insightful step-by-step reasoning for their outputs

The research areas of AI are being revolutionized by the rapid progress of LLMs. In the field of information retrieval, search engines are seeing a new shift towards chatbots (i.e., ChatGpt) for information seeking. Devin, a product developed by Cognition, is capable of performing software engineering tasks independently, powered by LLM. It can deploy apps end-to-end, autonomously find and fix bugs, and can train and fine-tune its own AI models and much more. In the field of computer vision, Sora, a product developed by OpenAI, can create realistic and imaginative scenes from text instructions. It can generate minute-long videos with complex scenes and multiple characters, accurately detailing the subject and background.

Traditional recommender systems struggle to explain why a particular recommendation was made to the user. Additionally, they are not adept at interacting with the user. For instance, chatbots that use LLM excel at user interaction, creating a feeling of conversing with a real human. [Gao et al. 2023] addresses these two challenges by converting user profiles and historical interactions into prompts. It does not rely on training but instead on in-context learning. It learns the user’s preferences during the conversation with the system. Therefore, it provides an LLM-based chatbot interaction interface while also enhancing explainability.

There are many types of recommender systems, each trained and developed for different tasks, each having thier own unique architecture.Due to this, it becomes very difficult to transfer the knowledge and representation from one task to another. Hence, there weren’t any unified recommendation systems which performed multiple tasks together from the same framework This issue was solved by [Cui et al. 2022; Geng et al. 2022] creating a single framework that could handle multiple recommendation tasks from the same architecture. Modern Recommendation systems typically represent users and items through unique identifiers (IDs), which are then transformed into embedding vectors as parameters that can be learned. These IDbased models (IDRec) have become a standard and have prevailed in the field of recommender systems since ever long, although it has few limitations:

It faces cold-start problem when the interactions between the user and items are low it is unsuccessful to provide recommendations.
In real world applications maintaining a large, up-to-date embedding matrix for user and item IDs become resource intensive.
Opaque Decision Process: IDRec models learn from and act upon the embeddings of user and item IDs. These embeddings are high-dimensional vectors that represent users and items in a latent space, where the dimensions don’t have an inherent meaning understandable to humans. Thus, when a recommendation is made, it’s challenging to explain in human-readable terms why the system considers the item a good match for the user, other than the fact that their embeddings are similar in the model’s latent space.
Opaque Decision Process: IDRec models learn from and act upon the embeddings of user and item IDs. These embeddings are high-dimensional vectors that represent users and items in a latent space, where the dimensions don’t have an inherent meaning understandable to humans. Thus, when a recommendation is made, it’s challenging to explain in human-readable terms why the system considers the item a good match for the user, other than the fact that their embeddings are similar in the model’s latent space.
Lack of Transferability: In IDRec systems, every user and item is assigned a unique identifier (ID) that is specific to a particular platform or dataset. These IDs are integral to how the system understands and categorizes data. The unique nature of these IDs means they are not inherently meaningful outside of their original context. For example, a user ID on one e-commerce site doesn’t correspond to anything on another site, and the same goes for item IDs.This specificity poses a problem when trying to apply a model trained on one platform to another. The model’s learned associations and embeddings for IDs do not translate across platforms because the IDs it “knows” don’t exist elsewhere. The inability to transfer models limits the development of large-scale, general-purpose RS models that can learn from diverse data sources and apply their insights universally. It necessitates training a new model from scratch for each platform, requiring significant data collection and computational resources

Therefore, in LLM based Recommendation system we alleviate these problems by converting the data in a format which is understood by the LLM and to create a unique identity(IDs) of the items and user we create a short characteristic tokens.

Fine Tuned

Generally there are two ways of creating Unique Identifiers(IDs) i.e Numerical( eg 1947) and Description( eg titles and attributes ) IDs although both of them have their advantages when used used with traditional recommender systems but when used with LLMs it has some disadvantages such as for Numerical IDs it loses the semantic information, and it cannot utilize the corpus of LLM, therefore it leads to poor recommendation. For the Description IDs it fails to make the user-item unique because of the common words( eg “and”, “the”, “will”). To tackle these issues [Lin et al. 2023] came up with two approaches which make the item unique and also make them distinctive

Each item is indexed using a combination of its ID, title, and attributes (such as category), allowing the system to capture both the distinctiveness and the semantic richness of each item.
Incorporating the data structure FM-Index

He also developed an aggregated grounding module that effectively utilizes these multifaceted identifiers to accurately rank and recommend items within the corpus. While [Liao et al. 2023] does not rely on ID, token, and text-based representations to represent items. Instead, a unique framework, Large Language and Recommendation Assistant (LLaRA), is presented. This framework skillfully converts the sequential recommendation challenge into language modeling by fusing existing recommender models with LLMs. Specifically, LLaRA creates a curriculum learning technique that gradually introduces sequential patterns into the tuning of LLMs, which are learned by conventional sequential recommenders.

In sequential recommendation, there are some flaws. For instance, models often use a score-and-rank technique (also known as the Top-K strategy). The flaw of this method is that it scores each item separately, meaning that similar items will likely score similarly. In this situation, similar item categories are likely to dominate the model output, which may be suboptimal. In some cases, it is preferable to present the user with a variety of item types. To tackle these issues [Petrov and Macdonald 2023] thier models determines what to recommend at position k only after producing recommendations at position 1..k -1. In order to create recommendations, the model ranks each item iteratively (apart from those that have previously been recommended), adding the highest-scoring item to the recommended items list.Furthermore, the paper proposes a novel SVD Tokenization that reduces the large vocabulary and GPU memory requirements based on the architecture of GPT-2. It breaks the item IDs into sub-item tokens. SVD Tokenization quantizes item embeddings derived from the SVD decomposition of the user-item interaction matrix to produce sub-item tokens. The items can then be generated token by token, in a manner akin to how words in texts are generated from sub-word tokens.Whereas [Yue et al. 2023] uses a two-stage framework for sequential recommendation. First, they retrieve the items using unique identification (IDs). In the second stage, they rank the candidate items by generating scores over candidate indices using the item titles to understand user behavior and transition patterns. Their LLM-based ranking model is used to comprehend user preferences for personalized recommendations. Their LLM ranker is specifically designed to expedite inference via a simple verbalizer, and it utilizes textual attributes for preference comprehension.

[Lin et al. 2023]aims to utilise both the open and closed source LLM. For the open source, it is fine tuned and uses LLM in place of the original content encoder, emulating the PLM-NR [Wu et al. 2021].For the closed source LLM it uses generative recommendation method, It increases the performance in downstream recommendation tasks by acquiring more informative textual and user attributes and enriching the current training data by implementing different prompting tactics.

[Li et al. 2023][Bao et al. 2023] aims to mitigate the issue of extensive fine-tuning and slow inference time by converting discrete prompts into continuous prompts. The latter author converts the recommendation data into instructions and then tunes it.

Not Fine Tuned

Traditional recommender systems, while successful in many aspects, often lack proficiency in user interaction.However, with the advent of Large Language Models (LLMs), there has been a significant shift in this dynamic. LLMs have demonstrated astonishing capabilities, particularly in the area of user interaction.

They can engage in meaningful conversations, understand user preferences more deeply. For instance, chatbots that use LLM excel at user interaction, creating a feeling of conversing with a real human. [Huang et al. 2023] solve this challenge by combining the capabilities of both, it combines the traditional recommender with the conversational capabilities of LLM. [Wei et al. 2024] solves the data sparsity challenge in the recommendation model by creating a denoised augmentation robustification mechanism.

They use three LLM-based graph augmentation strategies in three ways:

Providing additional support to the user-item interaction edge.
Improving the understanding of item node attributes.
Performing intuitive user node profiling from a natural language perspective.

[Peng et al. 2023] tries to determine whether text embeddings from LLMs can aid ads and recommendation services. The techniques they utilize include using GPT embeddings as an input feature, a regularization term, and a pre-training task to integrate the LLM corpus of information into basic PLMs and guide token embedding aggregation. [Ren et al. 2023] aims to enhance existing recommenders with LLM-empowered representation learning by matching the representation space of collaborative relational signals with the semantic space of LLMs. A cross-view mutual information maximization technique facilitates this alignment by creating a single semantic subspace where collaborative relational embeddings and textual embeddings are well aligned. This improves the quality of learned representations by utilizing both generative and contrastive modeling techniques.

Understanding the Evolution of Gpt: From Encoder Decoder to LLMs

Kavach Dheer — Thu, 28 Dec 2023 20:01:54 GMT

Table of Contents

Introduction
Methodologies — RNN, Seq2Seq
Encoder- Decoder
Attention Mechanism
Transformers
Transfer Learning
LLMs
References

1. Introduction

To fully understand how Gpt3 works and how it came into existence, think of it as watching a movie.Just like a movie has multiple parts that need to be understood to grasp the full story, ChatGpt also has different components that contribute to its functioning.

Before we delve into the details, let’s start with some basic methodology. This will provide a solid foundation before we explore the intricacies of Gpt3.

2. Methodologies

If you have basic understanding of RNN and Seq2Seq you can directly start reading from Encoder-Decoder

2.1 RNN

RNN stands for Recurrent Neural Network. It is a type of neural network that is designed to process sequential data by using feedback connections. Unlike traditional neural networks, which process input data independently, RNNs can maintain internal states or memory, allowing them to capture information from previous inputs and use it to make predictions or generate outputs. This makes RNNs particularly effective for tasks involving sequential data, such as natural language processing and time series analysis.

RNN has 4 types

Many to One — In this type of RNN architecture, multiple input sequences are processed, but only a single output is generated. It is often used for tasks such as sentiment analysis, where the goal is to classify a piece of text into a specific category.

Example

We will give many input, such as for a movie review than our model will tell based on the review of the movie, that it was positive[1] or negative[0]

2. One to many- In this type of RNN architecture, a single input is given, and multiple outputs are generated. It is often used for tasks such as image captioning, where the goal is to generate a descriptive caption for a given i

Example

Given an image ie the inpur, our model will generate a caption that describes the contents of the image ie the output of the model.

3. Many to Many — In this type of RNN architecture, multiple input sequences are processed, and multiple outputs are generated. It is often used for tasks such as machine translation, where the goal is to translate a sequence of words from one language to another.

Example

Given a sentence in English as input, our model will generate the corresponding sentence in French as output.

It has two types

Asynchronous — In this case, the input does not have to match the output exactly.

Example

For speech translation, the input “I love India” may produce the output “Mein Bharat se baheth pyaar kerta hun” for Hinglish. In this case, the input consisted of 3 words, but the output consisted of 7 words.

2. Synchronous — In Many to Many Synchronous RNN architecture, multiple input sequences are processed, and multiple outputs are generated in a synchronous manner. This type of RNN is commonly used for tasks such as speech recognition or music generation, where the input and output sequences have a one-to-one correspondence, basically the input will match the output exactly.

Example

For speech recognition, the input is an audio waveform, and the output is a sequence of recognized words or phonemes that correspond to the input audio.

Seq2Seq model particularly uses RNN Many to Many [ Asynchronous]

2.2 Seq2Seq

The Seq2Seq model, short for Sequence-to-Sequence model, is a type of neural network architecture that is used for tasks involving sequential data. It consists of two recurrent neural networks (RNNs), an encoder and a decoder.

The encoder takes an input sequence, such as a sentence in natural language or a series of data points, and processes it to create a fixed-length context vector, also known as the hidden state. This context vector encapsulates the input sequence’s information. The encoder RNN can be any type of RNN, such as LSTM or GRU.

The decoder then takes the context vector produced by the encoder and generates an output sequence, which can be of a different length or structure than the input sequence. The decoder RNN is typically another type of RNN, such as LSTM or GRU, and it uses the context vector as its initial hidden state.

Seq2Seq models have various use cases, including:

Machine Translation: Seq2Seq models have been successfully used for translating text from one language to another. The encoder processes the input sentence in the source language, and the decoder generates the corresponding sentence in the target language.
Chatbot and Conversational AI: Seq2Seq models are employed in building chatbots and conversational agents. They can take an input message from a user and generate an appropriate response based on the learned patterns and context.
Speech Recognition and Speech Synthesis: Seq2Seq models are utilised in converting spoken language into written text, as well as generating speech from text. The encoder processes the audio input, and the decoder generates the recognized words or the audio waveform.
Summarisation and Text Generation: Seq2Seq models are employed for tasks such as automatic text summarisation and generating coherent paragraphs of text based on a given prompt or topic.

These are just a few examples of the use cases for Seq2Seq models. The flexibility and effectiveness of Seq2Seq models make them suitable for a wide range of sequential data processing tasks.

Now that you have understood the basic methodology, let’s dive into the evolution of ChatGpt. ChatGpt was built on 5 previous research studies, which we will explore below. We won’t go into the full depth of each research as that would require a separate article. Instead, I will provide an overview of each study, explaining why it was used and the problem in each of them and how it moved from one research to another to finally develop the ChatGpt. Additionally, I will include links to each research papers if you want to delve deeper into them.

Lets start with first one.

3. Encoder Decoder

[1] The Encoder-Decoder model is a type of neural network architecture commonly used in sequence-to-sequence (Seq2Seq) tasks. It consists of two components: an encoder and a decoder.

The encoder takes an input sequence and processes it to create a fixed-length context vector, also known as the hidden state. This context vector represents the input sequence’s information and captures its important features. The encoder can be any type of recurrent neural network (RNN), such as LSTM or GRU.

The decoder, on the other hand, takes the context vector produced by the encoder and generates an output sequence. The output sequence can be of a different length or structure than the input sequence. The decoder is typically another type of RNN, such as LSTM or GRU, and it uses the context vector as its initial hidden state.

Limitation in Encoder Decoder

Encoder Decoder performs really well on small texts but on large texts it tends to forget important information. As the input sequence becomes longer, the context vector produced by the encoder may not be able to capture and retain all the relevant details. This is known as the “vanishing gradient” problem, leading to a loss of information. As a result, the decoder may struggle to generate accurate and coherent output sequences when faced with lengthy inputs.

As one can see in the image below , when we give the encoder decoder, more than 30 words the quality of the translation starts to degrade.

Now to solve this issue Attention Mechanism was developed, which bring us to the next part of the movie.

4. Attention Mechanism

[2] The Attention Mechanism is a component used in neural network architectures, particularly in sequence-to-sequence (Seq2Seq) models. It addresses the limitation of the Encoder-Decoder model, where the context vector may struggle to capture and retain all relevant details for lengthy input sequences.

In the Seq2Seq model, the encoder takes an input sequence and processes it to create a fixed-length context vector, which represents the input sequence’s information. This context vector is then passed on to the decoder.

The attention mechanism is a component used in the Seq2Seq model to address the limitation of the Encoder-Decoder model, where the context vector may struggle to capture and retain all relevant details for lengthy input sequences.

The attention mechanism allows the decoder to focus on specific parts of the input sequence when generating the output sequence. Instead of relying solely on the fixed-length context vector produced by the encoder, the attention mechanism dynamically weighs the importance of different parts of the input sequence at each decoding step.

To do this, the attention mechanism calculates the similarity between the decoder’s current hidden state and the encoder’s hidden states. By calculating this similarity, the attention mechanism determines which parts of the encoder data are most relevant for the decoder at each step.

In other words, the attention mechanism stores the information of all the encoder text data and calculates the similarity between the encoder data and the decoder’s current hidden state. This allows the decoder to know which parts of the encoder data are correct and important for generating the output sequence.

Limitation in Attention Mechanism

The attention mechanism allows the decoder to focus on specific parts of the input sequence when generating the output sequence. It calculates the similarity between the decoder’s current hidden state and all of the the encoder’s hidden states, due to this attention mechanism requires a significant amount of computation resources and training time, as it calculates the similarity between the encoder and decoder data for each word in the encoder sequence.

Now we come onto the climax of the movie, when the groundbreaking research paper that revolutionized the field of natural language processing and machine translation came which was “Attention is All You Need” . It was published in 2017 from Google Research. This paper further improved the Attention Mechanism by using parallel processing instead of sequential processing. This model reduced the computation resources and also reduced the training time.

5. GroundBreaking Research “Attention is all you Need” [Transformers]

[3] The paper introduces a novel neural network architecture called the Transformer, which is based solely on the attention mechanism without any recurrent or convolutional layers. The attention mechanism allows the model to focus on relevant parts of the input sequence when generating the output sequence, eliminating the need for recurrent connections and making the model highly parallelizable.

The Transformer model introduced in the paper achieved state-of-the-art performance on machine translation tasks, outperforming traditional sequence-to-sequence models with recurrent or convolutional architectures. It demonstrated that the attention mechanism alone, when combined with a self-attention mechanism called the “scaled dot-product attention,” can effectively capture dependencies between words in a sequence and generate accurate translations.

One of the key advantages of the Transformer model is its ability to capture long-range dependencies in the input sequence more effectively than traditional recurrent neural networks. This is achieved through the use of self-attention, which allows the model to attend to different parts of the input sequence at different positions. By attending to relevant words in the input sequence, the Transformer model can generate more coherent and accurate translations.

Limitations + Race between Google & OpenAI

The biggest challenge in developing this transformer is the requirement for a huge amount of data. It does not perform well on small or medium-sized datasets. Only companies like MAANG (Meta, Amazon, Netflix, Google) have access to such large amounts of data to train a transformer and achieve exceptional results.

Both Google and OpenAI recognized this as a significant opportunity and started working on developing their own transformers. In 2018, they each successfully developed their own transformers using the enormous amount of data they had access to.

Google’s transformer is called BERT, while OpenAI’s transformer is called GPT. The only difference between the two is that BERT is an encoder-only language model, while OpenAI is a decoder-only language model.

Since not everyone has access to such large datasets, researchers and individuals were unable to develop their own transformers. However, theoretically, using the architecture of ‘Attention is all you need’, everyone knew it would yield astonishing results. Unfortunately, only companies like MANG (Meta, Amazon, Netflix, Google) were able to take advantage of this. In 2018, a new research introduced the concept of Transfer Learning, which revolutionized research in the field of Natural Language Processing. This leads us to the next part of the movie: Transfer Learning.

6. Transfer Learning

[4] In 2018, a model and concept were developed and introduced that the use of transfer learning in Natural Language Processing (NLP), would yield very good results. Prior to this, transfer learning was only used in Convolutional Neural Networks, but not in NLP.

So, let’s first understand what transfer learning is.

Transfer learning is a machine learning technique that leverages knowledge gained from one task to improve performance on another related task. Instead of training a model from scratch for each specific task, transfer learning allows us to use pre-trained models that have already been trained on large amounts of data for a related task.

In transfer learning, the pre-trained model serves as a feature extractor, capturing general patterns and representations from the data. These learned features can then be used as input for a new model that is trained specifically for the target task. By using the pre-trained model as a starting point, the new model can benefit from the knowledge and insights gained from the previous task, even if the datasets are different.

Example -

For instance, if you have a dataset of images of cats and dogs, but you don’t have a large enough dataset to train a deep neural network from scratch, you can take a pre-trained model that has been trained on a large dataset, like ImageNet, and use it as a feature extractor. You can remove the last few layers of the pre-trained model and add new layers that are specific to your task, such as a classifier for classifying cats and dogs. Then, you can train this modified model on your smaller dataset, leveraging the knowledge and representations learned from the ImageNet dataset.

By using transfer learning, you can benefit from the pre-trained model’s ability to extract meaningful features from images and improve the performance of your model, even with limited training data. It saves training time and computational resources while still achieving good results.

Advancement in Natural Language Processing: A Game-Changing Era

The advancements in Natural Language Processing (NLP) after the introduction of transfer learning have been game-changing. With pre-trained models like BERT and GPT, researchers and developers can leverage transfer learning on these 2 transformers to improve the performance of their own models. This has led to significant progress in tasks such as machine translation, chatbot development, speech recognition, and text generation. The graph of natural language processing advancement has soared since the advent of transfer learning.

Lets move onto the final part of the movie LLMS

7. LLMs

OpenAI continued to work on their Large Language Models (LLMs) and developed GPT-2 and later GPT-3. Meanwhile, Google also made progress with their LLM, creating BARD, although OpenAI’s version proved to be superior. Additionally, Elon Musk is working on his own Language Model called Grok.

If you enjoyed the content, please give it a like. Each like acts as a dopamine rush for me.

8. References

I watched this video, which helped me write this document. — https://www.youtube.com/watch?v=8fX3rOjTloc

[1] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.

[2] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.

[4] Howard, J. and Ruder, S., 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

A one stop destination for Recommendation Systems

Kavach Dheer — Tue, 21 Nov 2023 20:44:03 GMT

Your one stop destination to gain all the knowledge you need from the basics of recommendation system to going in depth to three Major types of recommendation system.

Lets start first with what is recommendation system and why do we need them

Every day, we are inundated with choices and options. What to wear? What movie to rent? What stock to buy? What blog post to read? The sizes of these decision domains are frequently massive: Netflix has over 17,000 movies in its selection , and Amazon has over 410,000 titles in its Kindle store alone.

We need a system which will help in suggesting us or help us in decision-making process from the massive options and choices to choose from, and this is exactly where a Recommendation system comes into play.

Recommender Systems are the set of tools and techniques to provide useful recommendations and suggestions to the users to help them in the decision-making process for choosing the right products or services and giving them a good user experience.

What a recommendation system does is that it tailors and creates a personalised and unique place for each user.

For Example — [1]In Amazon showing programming titles to a software engineer and baby toys to a new mother.

A recommendation knows what a the user needs and suggest them those things.

Data Collection

Recommendation system needs data to work and there are 2 major ways through which data is collected

Explicit Feedback — Through user interface by giving them options.

Example —When we create a new Spotify account it asks what genre of music we like, what are the artist we like, our favourite band and so on.

However, it often requires users to actively provide this information, and not all users may be willing to do so. Additionally, collecting explicit feedback can be resource-intensive and may impact the user experience.

Implicit Feedback — Observing the User — [3]if a user purchases an item, that is a sign that the user likes the item, while if the user purchases and returns the item that is a sign that the user doesn’t like the item.Implicit feedback is more common and easier to collect because it doesn’t require users to explicitly rate or review items. It is often used when explicit feedback is scarce or unavailable.

In this article we will be discussing different types of recommendation system what are the algorithms used in them and what are the merits and demerits of each.

Content-based

Collaborative recommendation

Context base Recommendation System

Knowledge Based Recommendation System

Knowledge Graph Recommendation System

ChatGpt Bases Recommendation System

Lets understand the first one

Content Based Recommendation system

[2]Content-based recommendation systems try to recommend items similar to those a given user has liked in the past.

The way this recommendation system works is it first needs data of the user, and it starts collecting those data either by Explicit feedback or implicit feedback and keeps storing the data in database and creates a whole profile on that user.

Let’s illustrate how a content-based recommendation system works with a practical example in the context of movie recommendations:

Step 1:Item Representation: Each movie in the system is represented by various features, such as genre, director, actors, and user ratings. For instance:

Movie A: Action, Sci-Fi, Directed by Christopher Nolan, Starring Leonardo DiCaprio, User Rating 4.5/5

Movie B: Drama, Romance, Directed by Greta Gerwig, Starring Saoirse Ronan, User Rating 4.0/5

Movie C: Action, Adventure, Directed by Steven Spielberg, Starring Harrison Ford, User Rating 4.2/5

Step 2: User Profile Creation: The system builds a user profile based on the movies the user has interacted with. Let’s say the user has previously liked and watched:

Movie A: Action, Sci-Fi, Directed by Christopher Nolan, Starring Leonardo DiCaprio

The user’s profile might look like this:

Preferences: Action (high weight), Sci-Fi (medium weight), Christopher Nolan (high weight), Leonardo DiCaprio (medium weight)

Step3: Similarity Calculation: The system calculates the similarity between the user’s profile and other movies in the database. For example, it might find that Movie D shares many features with the user’s profile:

Movie D: Action, Sci-Fi, Directed by Christopher Nolan, Starring Tom Hardy, User Rating 4.3/5

The similarity calculation could result in a high similarity score between the user’s profile and Movie D because of the shared attributes with Movie A.

Step 4: Ranking and Filtering: Based on similarity scores, the system ranks the movies in the database. Movie D, being highly similar to the user’s profile, is ranked at the top.

Step 5: Recommendation Generation: The system presents Movie D as a recommendation to the user, as it’s the most relevant to their preferences. The user is more likely to be interested in watching Movie D because it aligns with their past interactions and preferences.

Step 6: Feedback Loop: If the user watches Movie D and provides feedback (e.g., by rating it or indicating whether they liked it), this feedback is incorporated into their profile. Over time, the user’s profile evolves to reflect their changing preferences.

This example simplifies the collaborative filtering process. In practice, large datasets and more complex algorithms are used to provide accurate recommendations.

Advantages

Personalization: Content-based recommendation systems can provide highly personalized recommendations because they focus on the specific preferences and characteristics of individual users. This personalization can enhance user satisfaction and engagement.
Domain-specific Recommendations: Content-based methods work well when there are domain-specific attributes or features that are important for recommendations. For example, in music recommendation, genre and artist preferences can be crucial.

Disadvantages

Limited Serendipity: Content-based recommendation systems may struggle to introduce users to new and unexpected items because recommendations are based on the user’s existing preferences and item features.

Example — a user may not follow football during the season but then become interested in the Superbowl.

Collaborative Recommendation system

In this recommendation, it recommends items based on the similar users. For instance, when the recommendation system seeks to suggest content to User A, it examines profiles to identify users with similar preferences. If it identifies that User B shares a similar profile with User A, the system recommends items to User A based on User B’s preferences and likes as well.

Let’s illustrate how a content-based recommendation system works with a practical example

Now, let’s say we want to recommend movies to Alice, who has already rated Movie A, Movie C, and Movie E. We’ll use user-based collaborative filtering to make these recommendations.

Step 1: User Similarity Calculation (Cosine Similarity):

We calculate the similarity between Alice and the other users based on their movie ratings.

Similarity(Alice, Bob) = 0.28
Similarity(Alice, Carol) = 0.95
Similarity(Alice, Dave) = 0.27

Step 2: Neighbourhood Selection:

We select a subset of users with the highest similarity to Alice. Let’s choose a similarity threshold of 0.5, so we select Carol (similarity 0.95).

Step 3: Rating Prediction:

Now, we predict Alice’s rating for movies she hasn’t seen (Movie B and Movie D) based on the ratings of users in her neighbourhood (in this case, just Carol).

Predicted Rating(Alice, Movie B) = (Similarity(Alice, Carol) * Rating(Carol, Movie B)) / Similarity Sum = (0.95 * 3) / 0.95 = 3
Predicted Rating(Alice, Movie D) = (0.95 * 4) / 0.95 = 4

Step 4: Top-N Recommendations:

Now, we recommend the top-N movies with the highest predicted ratings to Alice. Let’s say we recommend the top 2.

Top 2 Recommendations for Alice: Movie D and Movie B

So, based on user-based collaborative filtering, Alice should watch Movie D and Movie B because users with similar tastes (in this case, only Carol) liked these movies, and they are predicted to be good matches for Alice’s preferences.

This example simplifies the collaborative filtering process. In practice, large datasets and more complex algorithms are used to provide accurate recommendations.

There are majorly two types of Algotithms used in this

Memory based algorithms — These are heuristic based algorithms that try to predict target user rating for an item based on partial information available about the target user and normalized weights obtained from the dataset.Commonly used techniques in memory based algorithms are Pearson correlation coefficient and vector similarity techniques. Some advanced techniques in memory based algorithms include default voting, inverse user frequency, case amplification and imputation-boosted CF algorithms.
Model based algorithms — These are machine learning mod- els trying to recognize patterns in datasets available for CF. Commonly used methods in this category include Bayesian networks, clustering models, regression models, latent semantic models etc.

Advantages

No Need for Item Metadata: Unlike content-based recommendation systems that rely on item attributes, user-based collaborative filtering doesn’t require detailed information about items. It works solely based on user behavior, making it applicable to a wide range of item types.
Serendipity: It can introduce users to new and unexpected items that they might not have discovered on their own. This serendipity can enhance the user experience by exposing users to diverse content.

Disadvantages

Sparsity: Collaborative filtering can suffer from sparsity issues when dealing with large datasets, as most users have only interacted with a small fraction of available items. Techniques like matrix factorization and dimensionality reduction can help mitigate this issue.
Cold Start Problem: Collaborative filtering struggles to provide recommendations for new users or items with no interaction history. Hybrid recommendation systems that combine collaborative filtering with content-based or other approaches can address this problem.

Context Recommendation System

R : User × Item × Context → Rating,

In this recommendation it takes into the account the context when it is recommending something, rather than simply relying on users and items, whereas in this recommendation system it takes into account the time, location, temporal trends, device & platform, weather conditions and so on.

Let’s illustrate how a context-based recommendation system works with a practical example

Consider the application for recommending movies to users, where users and movies are described as relations having the following at- tributes:

Movie: the set of all the movies that can be recommended; it is defined as Movie(MovieID, Title, Length, ReleaseYear, Director, Genre).

User: the people to whom movies are recommended; it is defined as User(UserID, Name, Address, Age, Gender, Profession).

Further, the contextual information consists of the following three types that are also defined as relations having the following attributes:

Theater: the movie theaters showing the movies; it is defined as The- ater(TheaterID, Name, Address, Capacity, City, State, Country).

Time: the time when the movie can be or has been seen; it is defined as Time(Date, DayOfWeek, TimeOfWeek, Month, Quarter, Year). Here, attribute DayOfWeek has values Mon, Tue, Wed, Thu, Fri, Sat, Sun, and attribute TimeOfWeek has values “Weekday” and “Weekend”.

Companion: represents a person or a group of persons with whom one can see a movie. It is defined as Companion(companionType), where attribute companionType has values “alone”, “friends”, “girlfriend/boyfriend”, “fam- ily”, “co-workers”, and “others”.

Then the rating assigned to a movie by a person also depends on where and how the movie has been seen, with whom, and at what time. For example.

The type of movie to recommend to college student Jane Doe can differ sig- nificantly depending on whether she is planning to see it on a Saturday night with her boyfriend vs. on a weekday with her parents.