Google Summer of Code 2024 Red Hen Lab Final evaluation

Multilingual News LLM Fine tuning

Tarun Jain
5 min readAug 26, 2024

This article is the final evaluation update on my Google Summer of Code 2024 journey as a contributor to Red Hen Lab.

About the project

Red Hen has access to an extensive news archive, which has been processed through speech and natural language processing pipelines during previous Google Summer of Code and collaborative efforts. They propose to utilise their rich television news data to train a Large Language Model (LLM) capable of answering questions about the world. Furthermore, their goal is to make this model accessible to a broad open-source audience. This conversational LLM, built on news data, can be seamlessly integrated with other services to develop automated bots.

Dataset Creation

The dataset creation process utilized the Self-Instruct framework, where we maintained consistent instructions but varied the question and answer pairs. During the second half of the GSoC coding phase, we generated dataset entries for the remaining 11 months of the year. Alongside the English dataset generation, I also completed the extraction of context and metadata for other European languages, including Spanish, French, German, and Portuguese. Once the context and metadata extraction was completed, question and answer pairs were generated for each language.

We have made the beta dataset publicly available on HuggingFace.

  • Dataset-1:
  • Dataset-2:

Model Training

The Large Language Model was fine-tuned using Microsoft’s Phi-3-mini-4k instruct model as the base, with experiments also conducted on Google’s Gemma-7b-it model and experimentations on raw text data directly as well. The training process combined Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) using LoRA configuration, aiming to adapt the model’s behavior for a Retrieval Augmented Generation (RAG) pipeline. The model was loaded in NP4 quantization for optimal performance. Despite challenges such as extensive computational requirements and time-consuming experimental runs, with each session lasting 3–5 hours, the initial results using just one month’s data were promising, matching the performance of larger open-source models.

Building on this success, we expanded our training to encompass 11 months of data, significantly enhancing the model’s capabilities. Furthermore, we’re excited to announce the release of new models trained on datasets spanning six languages: English, Spanish, French, German, Portuguese, and a combined multilingual model.

This expansion into multiple languages marks a significant milestone in our project, enabling our model to function effectively across diverse linguistic contexts. The successful development of these specialized, reporter-like language models capable of operating in multiple languages represents a substantial advancement in multilingual AI for journalism and information retrieval.

  • Model-1:
  • Model-2:

Quantization

GGUF (GPT-Generated Unified Format) quantization is an advanced technique used to optimize large language models for efficient deployment and inference. This method reduces the model’s size and memory footprint while maintaining much of its performance. GGUF, an evolution of the GGML format, allows for more flexible and efficient model compression.

It works by converting the model’s parameters from high-precision floating-point numbers to lower-precision formats, such as 4-bit or 8-bit integers. This process significantly decreases the model’s size, enabling it to run on devices with limited resources or to load faster on more powerful systems.

Inference

HuggingFace Inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline,set_seed

model_name = "RedHenLabs/news-reporter-3b"

tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map="cuda")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

def test_inference(prompt):
prefix = "Generate a concise and accurate news summary based on the following question.\n Input:"
prompt = pipe.tokenizer.apply_chat_template([{"role": "user", "content": prefix+prompt}], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=512, do_sample=True, num_beams=1, temperature=0.1, top_k=50, top_p=0.95,
max_time= 180)
return outputs[0]['generated_text'][len(prompt):].strip()

res = test_inference(" What is the status of the evacuations and the condition of those injured?")
print(res)

Langchain Inference

from langchain_community.llms import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate

model_name = "RedHenLabs/news-reporter-3b"

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed, pipeline

tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name,load_in_4bit=True, trust_remote_code=True, device_map="auto")

pipe = pipeline("text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.1,
return_full_text=False,
do_sample=True)

hf = HuggingFacePipeline(pipeline=pipe)


template = """
<|user|>
Act as a news report and answer to the user question
Input: {question}
<|end|>
<|assistant|>
"""
prompt = PromptTemplate.from_template(template)

query = "What is the most common side effect of taking Chantix?"
chain = prompt | hf
print(chain.invoke({"question": query}))

Benchmark

All the source code of Google Summer of Code is pushed here:

Challenges faced for Google Summer of Code 2024

  • Setting up the optimal hyperparameters and observing the loss based on each experiment was a significant challenge. During the model implementation, multiple experiments were conducted to achieve the right results for the updated dataset. In initial experiments, I encountered issues with extra tokens being generated.
  • Computing and training large language models is always challenging. Since multiple experiments were run, the training of the model typically took 3–5 hours, depending on the dataset and hyperparameter changes.
  • Generating question and answer pairs for multilingual languages was a time-consuming process. We generated around 20K+ pairs combined with the English dataset, resulting in over 64K dataset pairs. This successfully led us to building two models with two datasets (English only, and English+French+Spanish+German+Portuguese).

Final thoughts: Future work

We’re excited to announce that we’re in the process of writing a research paper based on our GSoC work. This paper will detail our journey in developing a multilingual language model, highlighting the challenges we overcame and the innovative approaches we used.

Looking ahead, I am keen to continue our involvement with this project. I particularly interested in exploring the development of a Speech LLM (Large Language Model) to complement our text-based model. This new direction could open up exciting possibilities for voice-based interactions for news or media industry. Additionally, I am will be eager to join as a Mentor after building the new Speech LLM model as well.

Well thats the update for Google Summer of Code 2024. In true sense this has been one amazing summer.

--

--

Tarun Jain
Tarun Jain

Written by Tarun Jain

Youtube: AIWithTarun || ML @AIPlanet || GSoC'24 RedHen Lab ||GSoC'23 @caMicroscope || GDE in ML