Fine-tuning to enhance NLP tasks with small, self-hosted language models

Instruction fine-tuning Mistral:Instruct 7B improved standard named entity recognition (NER) F1 score by 12% and domain-specific NER by 30%.

Ana Areias
Kineviz
Published in
7 min readMar 14, 2024

--

Intro

In Kineviz SightXR, we can connect to either third-party proprietary models or open-source privately hosted language models to extract user-defined entities and relationships from unstructured data (such as the text in pdfs). These are then presented directly in the platform as a visual knowledge map that can be explored and analyzed further.

An important consideration is whether the information we’re interested in is likely to be reliably extracted. How well does our model perform out of the box? What methods can we use to increase the reliability of results?

In this article, we describe our experience fine-tuning Mistral:Instruct 7B, a Small Language Model (SLM) for two Named Entity Recognition (NER) tasks: a standard NER and a medical domain NER.

An SLM is defined as a Large Language Model (LLM) with only a few billion parameters. While typically less powerful than LLMs, SLMs offer unique advantages. They are smaller and require fewer computing resources, which means they can be fine-tuned faster. They can be hosted locally, addressing concerns of privacy. Self-hosting also means that your application is not susceptible to unexpected updates in a third-party API.

Here we compare our model’s zero-shot performance in the two NLP tasks with its fine-tuned performance. Zero-shot refers to the fact the model has not had specific training on the entities of interest. Fine-tuning, on the other hand, involves showing the model labeled examples from which it can learn. We find that fine-tuning can greatly improve the performance of a SLM.

Instruction tuning

Instruction tuning is a form of supervised fine-tuning where a pre-trained Large Language Model (LLM) undergoes further training on a smaller task-specific dataset. This dataset is enriched with high-level instructions aimed at directing the model’s behavior. Through this process, the model adjusts its parameters to suit the new data or task, drawing upon the knowledge acquired during its initial training.

We have two different labeled datasets, one for each task. They comprise input-output pairs of sentences and the corresponding ground-truth entities that should be extracted from the sentence. For each dataset, we randomly sample 1000 observations for the training set and 100 for the testing set.

The first task we’re calling “standard” Named Entity Recognition (NER) where given a sentence, the task is to identify entities of the type Person, Organization, Location, and a catch-all category Miscellaneous. For the standard NER we use the dataset from the Named Entity Recognition task introduced at CoNLL-2003. The count of entities in the training dataset shows that Miscellaneous is roughly half as frequent a category as the other three cases.

Example entry of standard NER dataset

Count of standard NER entities in 1000 training observations

Count of standard NER entities in 1000 training observations

We augment the input-output pairs of sentences and entities with an instruction prompt describing the NLP task to the LLM. The zero-shot example uses only the instruction prompt and the target sentence. The fine-tuning example uses the instruction prompt, target sentence, and expected output entities.

In our case, the instruction prompt and target sentences are contained between [INST] and [/INST] tags and the instruction, prompt and expected output are contained in beginning and end-of-sentence tags <s> and </s>. This is a format specific to the model that we are using. Fine-tuning will be more successful if our data matches the format of the model’s pre-training data, which varies model by model.

Example training dataset entry of standard NER dataset augmented with prompt for instruction fine-tuning

The second task is domain-specific NER, and it entails identifying biological entities of the type DNA, RNA, Cell Type, Cell Line, and Protein in a text. For the domain-specific NER case we use the BioNLP 2004 NER dataset. The count of entities shows that the dataset is not very balanced, RNA instances are very few and far between at 48 cases in 1000 observations, while protein examples abound.

Count of domain-specific entities in 1000 training observations

Example entry of domain-specific NER dataset augmented with prompt for instruction fine-tuning

We consider a predicted entity as correct only if it is an exact match of a ground truth entity. We measure the performance of the NER task in terms of precision, recall, and f1-score, where:

  • Precision indicates the proportion of correctly predicted positive instances out of all instances predicted as positive.
  • Recall, also known as sensitivity, measures the proportion of positive instances the model was able to correctly identify as positive. Recall measures the completeness of the model’s predictions.
  • The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall, making it useful for evaluating the overall performance of a model. The F1 score ranges from 0 to 1, with a value of 1 indicating perfect precision and recall. It is particularly valuable when there is an imbalance between the classes in the dataset.

As our base model, we use a pre-trained open-source model Mistral-7B-Instruct-v0.2 from Mistral AI that we download from the Hugging Face model hub. Other possible open-source models include Meta’s Llama 2, Microsoft’s Phi-2, and most recently Google’s Gemma.

To streamline the fine-tuning process, we adopt a strategy with QLoRA, a variant of Parameter-Efficient Fine-Tuning (PEFT). Initially, we quantized the model to 4-bits and subsequently employed Low-Rank Adaptation (LoRA) for training. Given that full-parameter fine-tuning demands substantial memory and high-performance GPUs, QLoRA effectively reduces memory usage during LLM fine-tuning with negligible performance tradeoffs.

Numerous parameters play a crucial role in the effectiveness of QLoRA. In our experiments, we trained QLoRA with specific settings, including a rank of 32 and an alpha value of 16. We trained for 500 steps with a batch size of 2, equivalent to one epoch considering our training dataset of size 1000. We used a learning rate of 2.5e-5, dropout of 0.05, and we added trainable weights to all the linear layers of the model. Using these settings, the trainable parameters are less than 2% of the total model parameters.

Because in this case, we were primarily focused on dialing down the fine-tuning process, we didn’t conduct a thorough search for parameters that would guarantee optimal performance in the NER task. When doing numerous runs across different parameters sets, it does help to use an experiment tracking platform like Weights and Biases.

Infrastructure

Our experiments were run using the Hugging Face Transformers library on a Colab Pro TPU instance T4 with 15 GB VRAM.

Results

Using a small training data set of only 1000 observations, we significantly improved NER performance as measured by the F1 score on 100 test observations.

In the standard NER case, overall F1 increased by 9 percentage points (pp) from 0.76 to 0.85, representing an 11.5 percent improvement. The biggest improvement was in Organization, the category the model found most difficult initially, where F1 increased by 13 pp, an almost 20 percent increase.The smallest change was in the Miscellaneous category, with only 3 pp or a modest 4% increase in the fine-tuned case as compared to the baseline.

Test set F1 statistics for standard NER

The gains in the medical domain dataset were even more pronounced, with a 30 percent increase in overall F1 score, equivalent to a 17 ppt increase in F1 from 0.57 to 0.75. For the two categories the base model found most difficult, the increase in F1 was a whopping 48 percent! In the categories where the model was performing best, the gains in performance were much less pronounced at 2 and 3 pp for DNA and RNA categories, respectively.

Test set F1 statistics for domain-specific NER

This simple experiment showed that fine-tuning can be an effective way to increase a self-hosted model’s ability to better identify entities of interest, especially those that are domain-specific and stray away from the standard case of entity extraction. Even without doing an extensive search for QLoRA hyperparameters, we achieved significant gains.

The next steps in our fine-tuning journey are to scale up to the full datasets, do more extensive hyperparameter tuning, and explore using Open AI’s Chat GPT 4 to generate synthetic training examples for data-scarce scenarios. Stay tuned! (Pun intended.)

Appendix

Precision and recall results for standard NER

Precision and recall results for medical domain NER

--

--