My Interestship 5.0 NLP Journey 🛣️

Prerna Mittal
Clique Community
Published in
9 min readAug 6, 2023

Hello fellow NLP enthusiasts! I was thrilled to be selected for Clique’s prestigious event, “Interestship 5.0,” as an NLP mentee to work on the project — “Quora Insincere Questions Classification using BERT”. The journey has been a rollercoaster of learning and growth, and I’m excited to share the technical intricacies and personal experiences that have shaped this remarkable journey.

Week 1: Embarking on the NLP Voyage

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI). It helps machines process and understand human language to perform repetitive tasks automatically.

The journey commenced with an introduction to Natural Language Processing (NLP) and its real-life applications. Our mentors provided us with a solid foundation by covering the basics of NLP, including key terminologies and fundamental techniques such as stop-word removal, normalisation, stemming, tokenisation, and lemmatisation. To ensure a hands-on experience, we actively coded in Jupyter notebooks using the NLTK library, gaining practical insights into the nuances of text processing.

Tokenization: Tokenization is the process of breaking down a sentence or text into individual words or subwords, known as tokens. These tokens serve as the basic building blocks for NLP models to process text data.

Stop-word removal: Stop words are common words in a language (e.g., “the,” “is,” “and”) that occur frequently but carry little or no semantic meaning. During stop-word removal, these words are filtered out from the text to reduce noise and improve the efficiency of text processing and analysis.

Normalization: Normalization is the process of transforming text data to a common, standardized format. It involves converting all text to lowercase, removing punctuation, and handling variations like converting “USA” and “U.S.A” to “usa” to ensure uniformity in the data.

Stemming: Stemming is the process of reducing words to their root or base form by removing suffixes or prefixes. For example, words like “running,” “runner,” and “ran” will be stemmed to “run.” Stemming helps in reducing word variations and collapsing similar words to the same root, simplifying text analysis.

Lemmatization: Lemmatization is a more advanced process compared to stemming. It involves reducing words to their base or dictionary form, known as lemmas. Unlike stemming, lemmatization considers the word’s meaning and context, providing more accurate results. For example, “better” and “best” will be lemmatized to “good.”

Week 2: Unveiling Transformers and BERT

General Architecture of Transformer Model

Transformers: Transformers are a type of deep learning model that revolutionized NLP tasks. They use self-attention mechanisms to analyze the relationship between different words in a sentence, allowing them to process input sequences in parallel and capture long-range dependencies effectively.

BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained transformer-based model developed by Google. It is bidirectional and captures context from both the left and right sides of a word, making it highly effective for various NLP tasks, including text classification and question answering.

As we delved deeper into the second week, we explored the concept of transformers and their significance in NLP. Concepts like attention mechanisms, transfer learning, pre-training, and fine-tuning were presented, paving the way for the BERT (Bidirectional Encoder Representations from Transformers) model. We worked with transformers and timm libraries and implemented BERT in Jupyter Notebook.

Attention Mechanisms: Attention mechanisms are components in transformer-based models that allow the model to focus on important parts of the input sequence by assigning varying weights to different tokens.

Transfer Learning: Transfer learning is a technique where a model pre-trained on a large dataset is fine-tuned on a specific task, leveraging the knowledge gained during pre-training to improve performance.

Pre-training: Pre-training is the initial phase of transfer learning, where a language model is trained on a diverse dataset to learn general language representations without any specific task in mind.

Fine-tuning: Fine-tuning is the subsequent phase of transfer learning, where the pre-trained language model is further trained on a task-specific dataset to adapt its learned representations to the target task.

Week 3: Data Preprocessing and the Quora Insincere Questions

Data preprocessing involves cleaning and transforming raw data to prepare it for analysis.

Week 3 was an exciting juncture that aimed to tackle the task of identifying insincere questions on Quora. Armed with the techniques learnt the previous week, we delved into data preprocessing, ensuring the data was primed and ready for analysis. Implementing text processing techniques in Jupyter Notebook further solidified our understanding.

The Quora NLP Dataset was assigned for Task-1, urging us to generate graphs and glean insightful conclusions from the data. The following day, we presented our findings during an insightful session on Exploratory Data Analysis (EDA) and TF-IDF.

Exploratory Data Analysis (EDA): EDA is the process of visually and statistically analyzing data to gain insights, identify patterns, and understand the structure and characteristics of the dataset.

This graph represents the distribution of target values (sincere vs. insincere questions) in the dataset. The x-axis represents the target values (0 for sincere, 1 for insincere), and the y-axis represents the count of each target value. It shows the balance or imbalance between the two classes.
This graph displays the frequency of the top N words in the dataset. It shows the count of each word on the y-axis, with the corresponding word labels on the x-axis.
The word cloud visually represents the most frequent words in the dataset. The size of each word in the cloud corresponds to its frequency.

TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a technique used to represent the importance of each word in a document relative to a collection of documents. It quantifies how frequently a word appears in a document (term frequency) and inversely weighs it based on its frequency in the entire document collection.

Week 4: Model Evaluation and Advanced Visualization

Model evaluation is the process of assessing the performance of a trained machine learning model using various metrics to ensure its effectiveness in solving the given problem.

The focus of Week 4 was model evaluation and advanced visualization. We delved into BoxPlots and Violin Plots, understanding their utility and mathematical foundations. In our project, we implemented TF-IDF Vectorizer, Logistic Regression, Confusion Matrix, Multinomial Naive Bayes, and Stratified K-Fold Sampling for handling imbalanced datasets. Realizing the significance of the F1 score as a robust metric for model evaluation, we refined our approach accordingly.

Stratified K-Fold Cross Validation: Stratified K-Fold Cross Validation is a technique used to evaluate model performance while maintaining class distribution balance. It divides the dataset into K subsets, ensuring that each subset contains a proportional representation of each class.

F1 Score: The F1 score is a metric that combines precision and recall to measure a model’s performance. It is especially useful when dealing with imbalanced datasets, as it considers false positives and false negatives equally.

Logistic Regression: Logistic Regression is a supervised learning algorithm used for binary classification tasks. It predicts the probability of an instance belonging to a specific class based on input features.

Confusion Matrix: A confusion matrix is a table used to evaluate the performance of a classification model. It shows the number of correct and incorrect predictions made by the model, along with true positive, true negative, false positive, and false negative values.

Week 5: Building the Final Project with BERT

In week 5, our attention shifted to the implementation of the BERT model. We explored BERT input with CLS (Classification) and SEP (Separation) tokens, along with the BERT tokenizer and embeddings.

CLS (Classification) Token: The CLS token is added at the beginning of the input sequence. It is used to represent the overall classification of the entire sequence. In tasks like text classification, sentiment analysis, and question answering, the final hidden state corresponding to the CLS token is often used as the aggregate representation of the input sequence, which is then fed to a classification layer for the specific task.

SEP (Separation) Token: The SEP token is used to separate different segments or sentences within the input sequence. In tasks that require multiple sentences or pairs of sequences (e.g., question answering, natural language inference), the SEP token is used to delimit the boundaries between segments, allowing the model to understand the relationships between them.

BERT Tokenization: The BERT tokenizer is used to tokenize the input text and convert it into BERT-specific input features (input_ids and attention_mask).

Custom Dataset: A custom dataset class is created to organize the tokenized data and targets into PyTorch tensors.

Data Preparation and Loading: The training, validation, and test data are prepared and loaded into PyTorch DataLoaders to efficiently handle batching.

Predictions and Metrics Calculation:

The trained BERT model is used to make predictions on the test set:

  • Predictions: The model predicts the target labels for the test set, and corresponding question texts are extracted.
  • Classification Report: The classification_report() function is used to compute precision, recall, F1-score, and other classification metrics, providing a comprehensive evaluation of the model’s performance.

The F1-score, which balances precision and recall, for Class 0 is 97.14%. The total number of instances for Class 0 in the dataset is 14,934.
The F1-score for Class 1 is 37.62%. The dataset contains 976 instances of Class 1.
These results suggest that the model performs well in distinguishing Class 0 instances but struggles with Class 1 instances, as reflected in the lower precision and recall for Class 1.
Overall Accuracy: The model’s overall accuracy, considering both Class 0 and Class 1 predictions, is 94.54%.

Check out my source code for this project along with the demo video here:

This project showcases the complete workflow of text data processing, exploration, BERT model implementation, training, evaluation, and metrics calculation for text classification tasks.

Mentorship: The Key to Success

Our mentor, the remarkable Pallavi Pannu, guided us throughout the journey. Her insightful explanations and unwavering support made the most complex concepts appear approachable. She generously addressed even the smallest of doubts, ensuring that we mastered the code and grasped the core concepts of NLP.

The essence of Interestship 5.0, which revolves around learning through hands-on projects and mentorship from industry professionals, proved to be a game-changer in my NLP exploration.

Project Fair: Showcasing our Triumphs

The wonderful Clique Interestship team!

The much-anticipated Project Fair arrived on 30th July, when each team, including ours, had the opportunity to present their projects. With enthusiasm and pride, we showcased “Quora Insincere Questions Classification using BERT.” We presented the project’s introduction, workflow, methodology, results, and future scope. We passionately demonstrated our hard work by utilising Google Colab Notebook, demo videos, graphs, flowcharts, and other visuals.

Star Cliquer: A Proud Moment of Recognition

To my sheer delight, my dedication and commitment to the NLP team earned me the ‘Star Cliquer’ title. The recognition from Star Cliquer filled my heart with joy, and I was rewarded with exciting goodies as a token of appreciation for my approach and passion for the project.

goodies!!

As I continue to tread on this exciting path, I eagerly look forward to applying my newfound knowledge in future projects and research endeavours. Thank you for joining me in my remarkable Interestship 5.0 NLP journey!

--

--

Prerna Mittal
Clique Community

Upcoming SWE @Microsoft | Ex-Intern @Microsoft, Cadence | Samsung PRISM Intern | NXP WIT Scholar'22 | UIUC+ Research Intern | Beta MLSA | GATE CS qualified