Embeddings: BERT better than ChatGPT4?

4 min readSep 18, 2023

In this study, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score.

We explore several embedding models including BERT, and ADA. We used the Software Defects Data (https://github.com/av9ash/bugrepo) containing bug reports for various software projects to evaluate the performance of these models.

Our experimental results show that BERT generally outperformed the rest of the models regarding recall, followed by ChatGPT4, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task.

The code implementation of experiments is available at: https://github.com/av9ash/DuplicateBugDetection

ChatGPT4 and BERT are both NLP (Natural Language Processing) models capable of understanding the semantic meaning of text, and have been applied to a variety of tasks including text classification, entity recognition, and more. This article aims to compare their performance in the specific task of identifying duplicate bug reports by evaluating their efficiency in capturing sentence text similarity.

This study used the Defects dataset. This dataset encompasses bug reports from multiple software projects: EclipsePlatform, MozillaCore, Firefox, JDT, and Thunderbird

DataSet

The dataset comprised of approximately 480,000 bug reports, each encompassing a summary, a description, and metadata attributes, including bug ID, priority, component, status, duplicate flag, resolution, version, created time, and time of resolution. Structured information, in addition to summary and description, helps improving accuracy [2]. The training data comprised of a collection of parent and unique bug reports for all experiments in this study. The test data set consists of child bug reports. Table I shows the count of bug reports used to train and test the models.

Embedding Models

GPT3.5, BERT, Fasttext, and Doc2Vec models required loading pre-trained models for their respective embeddings. Specifically, for BERT, we utilized the “all-mpnet-base-v2” model, optimized for various use cases and trained on a large and diverse dataset comprising over 1 billion training pairs. For ChatGPT4, we used “text-embedding-ada-002,” a GPT4 large language model for text search, text similarity, and code search. For Fasttext, we employed the “crawl-300d-2M-subword” model, which consisted of 2 million word vectors trained with subword information on the Common Crawl dataset, encompassing 600 billion tokens. In the case of Doc2Vec, we used the “GoogleNews-vectors-negative300” model, trained on a portion of the Google News dataset containing approximately 100 billion words. This model provided 300-dimensional vectors for 3 million words and phrases.

ADA, BERT, and Fasttext models are utilized without fine-tuning, while the Gensim model is fine-tuned specifically for the training PRs for each bug repository allowing us to leverage the strengths of these pre-trained models in our analysis.

Training/Testing

The training mechanism for the Information Retrieval (IR) model remains consistent and straightforward throughout this study. We employed the non-generalizing Nearest Neighbors model, which operates based on the principle of identifying the specified number of training samples that were closest in distance to a new point, utilizing a distance metric. Smaller distances indicate a higher degree of similarity between the points. We fitted this model with the training data embeddings from all the considered encoders, ensuring that the model incorporated the encoded information for effective retrieval and matching.

During testing, we queried the trained model using test data embeddings to obtain the top “n” matches. A query is successful if the known parent report ID is among the returned recommendations.

Results

Fig 1 provides insights into model comparison, revealing a clear performance order as follows: BERT > ADA> Gensim > TFIDF > Fasttext. Additionally, it is noticeable that the accuracy change was more prominent at smaller values of n and tends to flatten out as n increases, suggesting that fetching a more significant number of potential matches does not necessarily result in a significant increase in accuracy.

Based on the findings presented in Fig 2, it is evident that BERT consistently outperformed the other models regarding recall accuracy. On the other hand, Fasttext exhibited lower accuracy compared to the baseline TFIDF model. These results provide insights into the comparative performance of the models.

Conclusion

While both Ada embeddings and BERT embeddings show promise in automating the identification of duplicate bug reports, BERT demonstrates superior performance in capturing sentence text similarity based on the metrics recall.

The improvements in semantic understanding, efficiency, and generalizability make BERT a compelling choice for teams looking to streamline their bug tracking workflows

Please cite if used:

@misc{patil2023comparative,
  title={A Comparative Study of {Text Embedding Models} for {Semantic Text Similarity} in {Bug Reports}},
  author={Avinash Patil and Kihwan Han and Sabyasachi Mukhopadhyay},
  year={2023},
  eprint={2308.09193},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}