Harnessing the Power of BERT: Building an NLP Model to Predict Sincerity of Quora Questions

Anubha Sharma
Clique Community
Published in
5 min readJul 29, 2023

I count myself fortunate to have been selected as a participant in Interestship 5.0, a remarkable program organized by Clique Community in collaboration with IEEE Gujarat Section.

During this enriching experience, I had the privilege to explore the fascinating realm of Natural Language Processing (NLP) and be a part of a captivating project — building an NLP model using BERT to Predict whether the questions asked on Quora are Sincere or Insincere.

Unraveling the Foundations of NLP

As part of the captivating Interestship 5.0, we embarked on an immersive journey, thoroughly exploring the profound world of Natural Language Processing (NLP).

What is NLP?

NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.

Image Source: https://www.freepik.com/free-photos-vectors/nlp

Examples of NLP are:

  1. Speech Recognition
  2. Document Summarization
  3. Machine Translation

Streamlining Data with Text Preprocessing

First, we thoroughly explored two vital components: understanding the dataset and executing text preprocessing techniques.

Understanding the data was a pivotal initial step, enabling us to grasp its characteristics and potential challenges.

We then proceeded to streamline the dataset through text preprocessing, where we focused on 5 essential steps:

Step1-Punction Removal: We eliminated punctuation marks to streamline the text’s semantic content

Step 2-Lowering Text: Words like Book and book mean the same but when not converted to lowercase those two are represented as two different words in the vector space model (resulting in more dimensions).

Step 3-Tokenization: Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. NLTK — The Natural Language ToolKit is one of the best-known and most-used NLP libraries, useful for all sorts of tasks from tokenization, stemming, tagging, parsing, and beyond

Image Source: https://medium.com/@ajay_khanna/tokenization-techniques-in-natural-language-processing-67bb22088c75

Step 4-Stop Words Removal: Stop words are very commonly used words (a, an, the, etc.) in documents. These words do not really signify any importance as they do not help in distinguishing two documents.

Step 5-Lemmatization: Lemmatization reduces the words to a word existing in the language

Image Source: https://www.quora.com/What-is-tokenization-and-lemmatization-in-NLP

Unveiling Insights through Exploratory Data Analysis

After text preprocessing, we dived into Exploratory Data Analysis (EDA) and explored key NLP techniques, including Logistic Regression, Naive Bayes, and Confusion Matrix. Additionally, we harnessed the power of TF-IDF for feature extraction.

  1. Exploratory Data Analysis (EDA): Uncovering dataset patterns and characteristics.
  2. Logistic Regression: Supervised learning for binary classification.
  3. Naive Bayes: Probabilistic algorithm for text classification.
  4. Confusion Matrix: Evaluating model predictions.
  5. TF-IDF (Term Frequency-Inverse Document Frequency): Extracting important features based on word importance in documents.
The graph indicates 93.87% sincere and 6.19% insincere data in the dataset.
Top 10 Sincere and Insincere Words in the DataSet
Word Cloud-A word cloud is a collection, or cluster, of words depicted in different sizes
Confusion Matrix

Implementing BERT

In the culmination of our journey, we achieved a significant milestone — the successful implementation of the cutting-edge BERT model.

What is Bert?

BERT (Bidirectional Encoder Representations from Transformers) is a breakthrough natural language processing model developed by Google. It is designed to understand the context of words in a sentence by utilizing the Transformer architecture, a deep learning model for sequence-to-sequence tasks.

BERT is trained on a massive dataset containing around 3.3 billion words. The fine-tuning of BERT on specific tasks further refines its accuracy, often reaching 90% or higher on tasks like sentiment analysis and question answering.

There are two different BERT models:

1. BERT base, which is a BERT model consisting of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

2. BERT large, which is a BERT model consisting of 24 layers of Transformer encoder,16 attention heads, 1024 hidden sizes, and 340 parameters.

In the case of BERT, it creates three embeddings for Token, Segments, and Position.

In the case of BERT, it creates three embeddings for Token, Segments and Position.
Image Source: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Results:

For the model evaluation process, we relied on essential performance metrics like the Classification Report

Classification Report

F1 Score: A balanced metric representing the harmonic mean of precision and recall. It gauges the model’s overall accuracy in predicting both positive and negative classes.

Precision: The ratio of correctly predicted positive instances to the total predicted positive instances, highlighting the model’s ability to minimize false positives.

Recall: The ratio of correctly predicted positive instances to the actual positive instances, emphasizing the model’s capability to identify true positives.

Support: The number of occurrences for each class, providing valuable context for the model’s performance.

Accuracy: The percentage of correctly predicted instances out of the total instances evaluated, showcasing the overall effectiveness of the model.

GitHub repository

For those curious to delve into the project, I invite you to explore my GitHub repository where the code and insights await.

References

Bert: https://medium.com/@pallavipannu678/bidaf-vs-bert-observations-f79c79425b6d

https://www.analyticsvidhya.com/blog/2022/02/k-fold-cross-validation-technique-and-its-essentials/

Interestship 5.0 Experience

I extend my deepest appreciation to my mentor, Pallavi Pannu Ma’am, whose guidance and support have been instrumental in shaping my skills. Her exceptional teaching style made complex concepts simple and enjoyable.

Moreover, I am thankful to Clique Community and IEEE Gujarat Section for organizing this remarkable event, providing a platform for growth and collaboration among talented individuals.

Interestship 5.0 has been a journey of knowledge, camaraderie, and personal growth, and I am excited to continue advancing my passion for NLP and contributing to the world of AI. With immense gratitude, I bid adieu to this wonderful experience and eagerly await future endeavors in the world of technology and innovation.

--

--