Evaluating Performance on Different Domains by finetuning a Pretrained Large Language model:

“An Overview of Evaluating a Pretrained LLM (GPT-2) on Various Domains: A summary of Methods, Results, and Insights”

Rthvik Raviprakash
7 min readJun 18, 2023
Image source (Link)

This article provides a concise overview of a project I recently completed with two other collaborators during my Deep Learning course, along with the accompanying research paper. Its primary objective is to summarize our project and highlight the key points discussed in the paper.

Here is the link to the paper: Link

Github link for the code and data: Link

Introduction:

Fine-tuning existing LLMs has proven to be a viable method for producing language models geared toward specific tasks, with minimal resources compared to training them from scratch. We address the issue of fine-tuning and how cross-domain fine-tuning impacts LLM perplexity. Further, we explore the impact of dataset choice on the performance of fine-tuned models, focusing on the perplexity score as a metric. By fine-tuning GPT-2 and then evaluating on a set of various text domains, we aim to determine the optimal dataset for fine-tuning and investigate whether within domain transfer learning is more effective in the field of natural language processing (NLP).

Motivation:

Large language models or LLMs are widely used and continuously improved by research institutions and companies and are trained on a large number of examples to achieve fluency in a specific language. To train these models, researchers often rely on vast amounts of data from sources like Wikipedia, novels, web crawl corpora, and social media. However, there is a greater emphasis on data quantity rather than quality.

Smaller organizations or companies with limited resources can utilize pre-trained models as a starting point and then fine-tune them for their specific tasks. This approach reduces the resources required for training. However, fine-tuning a pre-trained model can still be costly, so it is crucial to consider the type, source, and quality of the training data. The choice of data for fine-tuning is essential for achieving the best results.

One approach to fine-tuning is cross-domain fine-tuning, which involves using text from a different domain than the target domain. By using cross-domain methods, less costly training data is needed to develop new machine learning tasks [1]. For instance, an LLM intended for auto-completion of academic papers could be initially fine-tuned using text from newspapers. Cross-domain fine-tuning offers two main advantages. Firstly, it is possible that LLMs perform better in their target domain when initially trained with a different domain. Secondly, incorporating text from another domain increases the overall amount of training data available. This area of study has the potential to enhance the efficiency and results of LLM fine-tuning.

In our paper, we aim to investigate the impact of fine-tuning and evaluating (GPT-2) a generative language model across various text domains. To ensure consistency, we have employed a fixed design for the language model by maintaining constant hyperparameters throughout the experimentation process. Initially, we have established baseline measurements using the pre-trained model without any fine-tuning. Subsequently, we will individually fine-tune the model on three distinct text domains and assess the performance of these fine-tuned models across all three domains based on perplexity scores as the metric.

Data:

The datasets that we have used in this paper were retrieved from Kaggle. We picked three distinct domains which are Philosophy, Poetry and BBC news to finetune the model and specifically selected them to provide a comprehensive evaluation framework. Philosophy and poetry domains were chosen due to their unique characteristics, as they are domains where the model is unlikely to perform optimally without fine-tuning. This decision allows us to thoroughly assess the impact of fine-tuning on the model's performance. In contrast, the news dataset offers a contrasting perspective, considering the abundance of information available for the model's pretraining.

You can find the data on the Github link to this project.

Methods:

The baseline model:

The baseline method utilizes the pre-trained GPT-2 model due to its strengths in generative language generation and contextual coherence. Compared to other models like BERT, GPT-2 is better suited for this experiment as it focuses on predicting the next word given context. The effectiveness of fine-tuning is evaluated using perplexity scores as mentioned previously, which measure the model’s ability to predict the next token in a sequence. The goal is to determine if fine-tuning GPT-2 on specific domains improves its perplexity scores, indicating a better understanding of domain-specific language.

Preprocessing the data:

For the experiments, three distinct text domains are chosen: historical philosophical writings, poetry, and BBC news articles/summaries. Publicly available text datasets from Kaggle are collected for each domain, ensuring they are large and cover a wide range of topics. To prepare the data for fine-tuning, pre-processing steps are applied. These include removing special characters, converting to lowercase, and tokenization. The data is also condensed to a standard size, a portion of the combined dataset, to maintain consistency and mitigate the impact of training and test size on perplexity.

Finetuning:

The fine-tuning process is conducted separately for the three chosen domains: historical philosophical writings, poetry, and BBC news articles/summaries. Each domain’s specific training set is used to fine-tune the pre-trained GPT-2 model. To maintain consistency and isolate the domain’s influence on performance, the hyperparameters, such as learning rate, batch size, training epochs, and model architecture size, remain constant across all models. Refer to Figure 1 in the paper for the pseudocode detailing the training process.

Evaluation:

Perplexity is utilized as the objective evaluation metric to measure the performance of the fine-tuned models. Perplexity scores indicate how well a language model predicts the next token in a sequence. It is calculated by exponentiating the cross-entropy loss [2], which measures the difference between the model’s prediction and the true value.

Calculating perplexity using cross-entropy loss (retrieved from: link)

Here PP(W) is the perplexity score and H(W) is the cross entropy loss. Please refer to [2] for in-detail explanation of the formula and how perplexity is calculated.

Lower perplexity values are associated with more accurate language models. By using perplexity as the evaluation metric, the performance of the fine-tuned models across different text domains can be directly compared. Perplexity scores provide insights into the impact of fine-tuning on the models’ ability to capture domain-specific language patterns.

Results and conclusion:

The models include the pre-trained GPT-2 as a baseline and fine-tuned models in the domains of Philosophy, Poetry, and BBC News. The baseline GPT-2 model had higher perplexity scores in all domains, indicating lower fluency. It performed poorly on philosophy texts but better on news reports, which was expected due to the influence of news reports in the pre-training process. We also use a fourth general domain to evaluate how each finetuned variant compares to the baseline, where the general domain consists of randomly chosen examples of equal quantities from each of the philosophy, poetry and the BBC news datasets.

Below is the graph for the results of the perplexity values scaled logarithmically across the fine-tuned domains.

Perplexity scores across domains, fine-tuning, and hyperparameters.

As you can see from the graph fine-tuning the model on philosophy led to a decrease in perplexity for philosophy texts, with some improvement seen in poetry but not in news or general text. Fine-tuning on poetry decreased perplexity in poetry and also improved results in the general dataset. No significant changes were observed for other domains. When fine-tuned on news, the model achieved the lowest perplexity scores in the news report and general domains compared to previous models.

The hyperparameters were not finely tuned in our case, leaving room for future work and exploration. The results suggest a connection between the poetry and philosophy datasets and another connection between the news and general datasets. Fine-tuning on philosophy disproportionately improved perplexity scores in poetry, while fine-tuning on news reports resulted in the largest improvement for the news and general domains.

The findings indicate that intentional domain selection for cross-domain fine-tuning can enhance LLM performance. This knowledge can influence how organizations gather training data and utilizing data from different domains can increase the available training data. Further research is necessary to understand the reasons behind cross-domain fine-tuning success and the impacts of different domains on fine-tuning.

That’s it from me and feel free to drop any suggestions or questions regarding the article or the paper and I am happy to talk about it!

References:

[1] Stefan van Berkum, Sophia van Megen, Max Savelkoul, Pim Weterman, and Flavius Frasincar. 2022. Fine-Tuning for Cross-Domain Aspect-Based Sentiment Classification. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT ’21). Association for Computing Machinery, New York, NY, USA, 524–531. https://doi.org/10.1145/3486622.3494003

[2] Campagnola, C. (2022, November 5). Perplexity in language models. Medium. https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

--

--