Fine-Tuning LLAMA 3 Model for Relation Extraction Using UBIAI Data

Walid Amamou
UBIAI NLP
Published in
6 min readAug 13, 2024

Extracting meaningful relationships from unstructured text has gained critical importance. This article examines the process of fine-tuning Large Language Models (LLMs), particularly LLAMA 3, for Relation Extraction tasks. We will discuss how to utilize data annotated with UBIAI, a state-of-the-art annotation platform, to boost the model’s efficiency in identifying and classifying semantic relationships within text.

What is Relation Extraction?

Relation Extraction is a leading task in NLP, acting as a crucial link between unstructured text and structured knowledge. This process entails identifying entities within a text and determining the semantic relationships that connect them. For example, in the sentence “Tesla, founded by Elon Musk, is revolutionizing the electric vehicle industry,” a relation extraction system would recognize the entities “Tesla,” “Elon Musk,” and “electric vehicle industry,” and extract relationships such as “founded by” and “revolutionizing.”

Large Language Models (LLMs)

The advent of Large Language Models has ushered in a new era of NLP capabilities. These sophisticated models, trained on vast corpora of text, possess an intrinsic understanding of language structures and semantics. LLAMA 3, a state-of-the-art LLM, exemplifies this power with its ability to comprehend and generate human-like text across diverse domains. By fine-tuning LLAMA 3 for relation extraction, we can leverage its deep language understanding to extract nuanced relationships from text with unprecedented accuracy.

Data Preparation for Fine-Tuning

Preparing Data: The UBIAI Advantage

At the heart of any successful machine learning project lies high-quality data. This is where UBIAI shines as a game-changing annotation platform. UBIAI goes beyond traditional annotation tools by offering:

  1. Advanced Document Processing: UBIAI excels in extracting text from various document formats, maintaining structural integrity and contextual information crucial for relation extraction tasks.
  2. Intuitive Annotation Interface: The platform provides a user-friendly environment for annotators to effortlessly identify entities and define relationships, ensuring consistent and accurate labeling.
  3. Quality Control Mechanisms: UBIAI incorporates built-in validation tools and inter-annotator agreement features, significantly enhancing the reliability of the annotated dataset.
  4. Customizable Annotation Schemas: Users can define custom entity types and relationship categories, tailoring the annotation process to specific domains or use cases.
  5. Collaborative Workflows: UBIAI supports team-based annotation projects, allowing for efficient distribution of tasks and seamless collaboration among annotators.

By leveraging UBIAI’s powerful features, researchers and data scientists can create high-fidelity datasets specifically designed for training relation extraction models. This meticulously annotated data serves as the foundation for fine-tuning LLAMA 3, enabling it to excel in extracting complex relationships from text across various domains.

Data Preprocessing

After exporting the annotated data from UBIAI in JSON format, it needs to be preprocessed to match the format required for fine-tuning the LLAMA 3 model. Here’s a Python script that demonstrates this process:

import json
import pandas as pd

def preprocess_json(data, possible_relationships):
# Extract the relevant information
document = data['document']
tokens = data['tokens']
relations = data['relations']

# Create a mapping of token index to its text and entity label
token_info = {i: {'text': t['text'], 'label': t['entityLabel']} for i, t in enumerate(tokens)}

# Format the entities and relationships
entities = [(t['text'], t['entityLabel']) for t in tokens]
formatted_entities = ", ".join([f"{text} ({label})" for text, label in entities])

formatted_relations = []
for r in relations:
child_index = r['child']
head_index = r['head']

if child_index < len(tokens) and head_index < len(tokens):
child = token_info[child_index]['text']
head = token_info[head_index]['text']

relation_label = r['relationLabel']
formatted_relations.append(f"{child} -> {head} ({relation_label})")

formatted_relations = "; ".join(formatted_relations)

# Create the formatted prompt and response
prompt = f"systemExtract relationships between entities from the following text.user Text: \"{document}\" Entities: {formatted_entities}. Possible relationships: {', '.join(possible_relationships)}."
response = f"assistantThe relations between the entities: {formatted_relations}"
full_prompt = prompt + response
return full_prompt

# List of possible relationships (customize as needed)
possible_relationships = ["MUST_HAVE", "REQUIRES", "NICE_TO_HAVE"]

input_path = "/content/UBIAI_REL_data.json"
# Read the input JSON file
with open(input_path, 'r') as file:
data = json.load(file)

# Preprocess all JSON strings
data = [preprocess_json(j, possible_relationships) for j in data]

# Convert to a DataFrame
df = pd.DataFrame(data, columns=["text"])

# Save to CSV
df.to_csv('fine_tuning_data.csv', index=False)

Got it! The script I provided formats the data into prompt-response pairs suitable for fine-tuning models like LLAMA 3. If you have any more questions or need further assistance with this or any other task, feel free to ask!

Fine-Tuning LLAMA 3

To fine-tune the LLAMA 3 model for relation extraction, you’ll typically use libraries like Hugging Face’s Transformers library in Python. Here’s a basic setup to get you started:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig
from trl import setup_chat_format

model_id = "meta-llama/Meta-Llama-3-8B"

# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
tokenizer.model_max_length = 2048

# Quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)

# Model setup
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device_map,
quantization_config=bnb_config
)

model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)

# LoRA configuration

peft_config = LoraConfig(
lora_alpha=128,
lora_dropout=0.05,
r=256,
bias="none",
target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"],
task_type="CAUSAL_LM",)

This code sets up the LLAMA 3 model with 4-bit quantization and LoRA (Low-Rank Adaptation) for efficient fine-tuning.

Training Configuration

Next, we set up the training arguments:

from transformers import TrainingArguments

args = TrainingArguments(
output_dir="sft_model_path",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="adamw_8bit",
logging_steps=10,
save_strategy="epoch",
Learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="constant",
report_to="tensorboard",)

These arguments define various aspects of the training process, such as the number of epochs, batch size, learning rate, and optimization strategy.

Training the Model

Now we can set up the trainer and start the fine-tuning process:

from trl import SFTTrainer

trainer = SFTTrainer(
model=model,
args=args,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=512,
tokenizer=tokenizer,
dataset_text_field="text",
packing=False,
dataset_kwargs={
"add_special_tokens": False,
"append_concat_token": False,
}
)
trainer.train()
trainer.save_model()

This code initializes the SFTTrainer (Supervised Fine-Tuning Trainer) and begins the training process. After fine-tuning, the script merges the newly learned weights with the base model.

from peft import PeftModel

base_model = "meta-llama/Meta-Llama-3-8B"
new_model = "/content/REL_finetuned_llm"

base_model_reload = AutoModelForCausalLM.from_pretrained(
base_model,
return_dict=True,
torch_dtype=torch.float16,
trust_remote_code=True,
)

base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)

model = PeftModel.from_pretrained(base_model_reload, new_model)
model = model.merge_and_unload()

model.save_pretrained("llama-3-8b-REL")
tokenizer.save_pretrained("llama-3-8b-REL")

This code loads the base model, applies the fine-tuned weights, and saves the merged model.

Inference

Finally, we can use the fine-tuned model for inference:

messages = [{"role": "user", "content": """Extract relationships between entities from the following text. Text: "1+ years development experience on Java stack AppConnect / API's experience is added advantage. Compute, Network and Storage Monitoring Tools (Ex: Netcool) Application Performance Tools (IBM APM) Cloud operations and Automation Tools (VmWare, ICAM, ...) Proven Record of developing enterprise class products and applications. Preferred Tech and Prof Experience None EO Statement IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status. ." Entities: 1+ years (EXPERIENCE), development (SKILLS). Possible relationships: EXPERIENCE_IN, LOCATED_IN, WORKS_FOR, PART_OF, CREATED_BY."""}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
device_map="auto",
)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Here’s the reformulated version:

Output: 1+ years (Experience) -> development (Skills) (Experience_In)

This tutorial has illustrated how to utilize a fine-tuned model to extract relationships from text.

Conclusion

Fine-tuning LLAMA 3 for relation extraction involves several key steps, including data annotation, preprocessing, model setup, fine-tuning, and inference. By following this guide, you can harness LLAMA 3’s capabilities to extract meaningful relationships from your text data. LLAMA 3’s flexibility and robustness make it well-suited for various NLP tasks, including relation extraction.

This tutorial has showcased the end-to-end process using data annotated in UBIAI. It demonstrates how advanced language models can transform raw text into structured information. The combination of UBIAI’s annotation capabilities and LLAMA 3’s powerful language understanding offers a potent tool for extracting valuable insights from unstructured text data.

--

--

Walid Amamou
UBIAI NLP

Founder of UBIAI, annotation tool for NLP applications| PhD in Physics.