Fine-Tuning a Language Model to Emulate Researcher Writing Style

Abdulrehman
6 min readMay 5, 2024

--

Unlocking the secrets of scholarly expression, we delve into the nuanced world of fine-tuning language models to emulate the distinct writing style of researchers. From meticulous citation practices to the precision of technical terminology, our journey navigates the intricacies of academic prose, crafting language models that seamlessly embody the essence of scholarly communication. Join us as we bridge the gap between artificial intelligence and academic excellence, paving the way for a new era of research-driven linguistic mastery.

Data Preparation

During our data preparation phase, we conducted an exhaustive process of data acquisition from scholarly research papers across diverse disciplines. This involved meticulous curation of datasets to encompass a wide array of topics, methodologies, and writing conventions prevalent in academic discourse. Employing advanced data extraction and cleaning techniques, we meticulously curated each dataset to ensure accuracy, integrity, and relevance.

Research Papers

We start by collecting the dataset from the Google Scholar profile and gathering all the available PDFs from it. This step involves downloading the papers in PDF format, which will serve as raw material for our text extraction process.

Text Extraction

For the text extraction, we will be using the PyPDF2 library to extract text from each paper. The process is done iteratively going through each page.

import re
import PyPDF2
import os

def process_pdf(pdf_path):
# Open the PDF file
with open(pdf_path, 'rb') as file:
# Create PDF reader object
pdf_reader = PyPDF2.PdfReader(file)

# Extract text from each page
text = ''
for page_num in range(len(pdf_reader.pages)):
text += pdf_reader.pages[page_num].extract_text()

return text

Preprocessing

Now the most important part is the preprocessing part. Here we will be removing all the numbers, special characters, junk text, references, and emails and save the text in lowercase form.

def remove_emails(text):
# Regex pattern to match email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return re.sub(email_pattern, '', text)

def remove_university_references(text):
# Remove entire lines containing the word "university"
lines = text.split('\n')
cleaned_lines = [line for line in lines if 'university' not in line.lower()]
return '\n'.join(cleaned_lines)

def remove_junk_text(text):
# Remove specific punctuation marks
text = re.sub(r'[,;:?!]', '', text)
# Remove specific words like "fig" and any remaining non-alphanumeric characters
return re.sub(r'\bfig\b', '', text, flags=re.IGNORECASE)

def remove_reference_section(text):
# Find the index of the "REFERENCE" section
reference_index = text.find('REFERENCE')
if reference_index != -1:
# Return the text only up to the "REFERENCE" section
return text[:reference_index]
else:
return text

def preprocess_text(text):
processed_text = remove_emails(text)
processed_text = remove_junk_text(processed_text)
processed_text = remove_university_references(processed_text)
processed_text = remove_reference_section(processed_text)
return processed_text

Making Chunks of the Dataset

import os
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=1050,
chunk_overlap=70,
length_function=len,
is_separator_regex=False,
)

# Load text from JSON file
with open('extracted_text.json', 'r') as f:
data = json.load(f)

# Initialize dictionary to store chunks for each text item
chunks_dict = []

# Split text into chunks and store them in dictionary
for index, text_item in enumerate(data):
chunks = text_splitter.split_text(text_item)
chunks_dict.extend(chunks)

# Save chunks to a new JSON file
with open('output_chunks.json', 'w') as f:
json.dump(chunks_dict, f, indent=4)

print("Chunks have been saved to 'output_chunks.json'.")

Converting Text into the neutral text using the LLM

from langchain_together import Together
from langchain.prompts import PromptTemplate
import json
import asyncio
from langchain_core.prompts.few_shot import FewShotPromptTemplate
import pprint


example_prompt = PromptTemplate(
input_variables=["Raw_text", "Neutral_text"], template="Raw_text: {Raw_text}\nNeutral_text:{Neutral_text}"
)

prompt=PromptTemplate(
template=
"""Covert the given text into neutral text.
Given text:\n{text}""",
input_variables=["text"]
)

llm = Together(
model="codellama/CodeLlama-70b-Instruct-hf",
temperature=0.7,
top_k=1,
together_api_key="API_Key",
max_tokens=500
)

with open('./output_chunks.json', 'r') as f:
data = json.load(f)

all_prompts = []
for chunk in data:
formatted_prompt = prompt.format(text=chunk)
all_prompts.append(formatted_prompt)

chain=prompt|llm


print(len(data))

count=0

neutral_texts=[]
while len(all_prompts)>count:
ans=asyncio.run(chain.abatch(all_prompts[count:count+100]))
for a in ans:
neutral_texts.append({"Raw_text":data[count],"Neutral_text":a})
count+=1
print("Processed ",count," texts")
with open('neutral_texts.json', 'w') as f:
json.dump(neutral_texts, f, indent=4)

Now here's the thing the model only accepts the JSON, CSV, or TXT format for now and our data is in raw format which we can directly process in the model. This would not be a good approach as we need it trained on the researcher's writing style.

Fine Tuning Language Model Using LLAMA Factory

LLAMA Factory provides options for model training using the Gradio Webui interface which makes it easy to train the model according to our data. Here are the example configurations and Examples we used for the training of our model. We can use the model here as well:

Before starting the training we need to do some fixings

Interface.py

Navigate to interface.py and locate the launch() function. Right-click to access its definition, which will open blocks.py. Inside blocks.py, set share: bool = True. This step facilitates sharing functionalities within the web interface, enhancing collaboration and accessibility.

def launch():
share: bool = True

Constants.py

In constant.py we need to declare our model which we need to train which in our case is TinyLLAMA

"LLaMA-Tiny": {
DownloadSource.DEFAULT: "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
},

dataset_info.json

We need to declare our dataset to the LLAMA factory

"identity":{
"file_name":"dataset_name.json",
"columns":{
"prompt":"prompt",
"response":"output"
}
},

Hugging Face Interface: https://huggingface.co/MorTal007/Mimicker

StreamLit App: https://mimicker-abdulrehman-codes.streamlit.app/

Machine Learning Model Training Configuration

Stage

The stage to perform in training: Supervised Fine-Tur

This setting specifies that the training process will be a supervised fine-tuning task.

Hyperparameters

Learning Rate
The initial learning rate for AdamW: 5e-5

The learning rate determines the step size at each iteration of the optimization algorithm. A smaller learning rate may lead to more stable convergence, but slower training.

Epochs
Total number of training epochs to perform: 3.0

An epoch refers to one complete cycle through the entire training dataset. Setting a higher number of epochs generally results in better model performance, but can also lead to overfitting.

Maximum Gradient Norm
Norm for gradient clipping: 1.0

Gradient clipping is a technique used to prevent exploding gradients during training. This setting specifies the maximum norm (magnitude) of the gradients before they are clipped.

Max Samples
Maximum samples per dataset: 100000

This setting limits the number of samples (data points) used from the dataset during training, which can be useful for large datasets or when working with limited computational resources.

Compute Type
Whether to use mixed precision training: fp16

Explanation: Mixed precision training uses lower precision (e.g., fp16) for certain computations to improve performance and memory efficiency while maintaining fp32 precision for critical parts of the model.

Sequence Parameters
Cutoff length: Max tokens in the input sequence: 1024
Batch size: Number of samples processed on each GPU: 2
Gradient accumulation: Number of steps for gradient accumulation: 8
Val size: Proportion of data in the dev set: 0

These settings control various aspects of how the input data is processed and handled during training. The cutoff length determines the maximum length of input sequences, batch size specifies how many samples are processed together, gradient accumulation allows accumulating gradients over multiple batches, and val size sets the proportion of data used for validation.

Learning Rate Scheduler
Name of the learning rate scheduler: cosine

The learning rate scheduler adjusts the learning rate during training, often decreasing it over time. The cosine scheduler follows a cosine annealing schedule, which can help improve convergence and generalization.

Additional Options
Resize token embeddings: Resize the tokenizer vocab and the embedding layers.
Pack sequences: Pack sequences into samples of fixed length.
Upcast LayerNorm: Upcasts weights of LayerNorm in float32.
Enable S+2 Attention: Enables the shift+2 attention mechanism proposed by LongfGPT.
Enable LLaMa Pro: Makes the parameters in the expanded blocks trainable.
Enable external logger: Uses TensorBoard or wandb to log the experiment.

These additional options provide various capabilities and customizations, such as resizing token embeddings, packing sequences into fixed-length samples, using higher precision for certain operations (e.g., LayerNorm), enabling specialized attention mechanisms (e.g., S+2 Attention), making certain model parameters trainable, and using external logging tools like TensorBoard or wandb.

This configuration interface provides a comprehensive set of hyperparameters and options for fine-tuning and customizing the training process of a machine-learning model for specific tasks and datasets.

Conclusion

The code performs the preprocessing the dataset and explains the training of the LLAMA model.

If you want to read more of my content or follow my work, you can connect with me on LinkedIn, Twitter, and Github.

--

--

Abdulrehman

CS Student and a Data Science and Machine Learning Enthusiast