Spending my summer ’24 in GSoC with Red Hen Lab

Rohan Kumar Singh
11 min readMay 2, 2024

--

This blog is maintained by Rohan Kumar Singh about the updates on the progress of the GSoC 2024 project with Red Hen Lab called “Chatty AI”.

Project Specifics

Title: ‘Chatty AI’

Mentors: Mark Turner, Arthur Lorenzi Almeida and Marcelo Viridiano

Abstract: Red Hen Lab proposes the development of an open-source chatbot specializing in Construction Grammar (CxG) and FrameNet (FN). This chatbot will serve as a comprehensive resource for those interested in these linguistic frameworks, offering explanations, directing users to relevant materials, and facilitating conversations.

Week 0: We discussed over a wide range of topics which included CWRU HPC, several transformer based models and FrameNet. It was suggested that I familiarize myself with SSH keys, remote HPC access via console, and the Frames XML structure. A sample of parsed XML Frame (Abandonment) file was analysed and further refinements were outlined. At the conclusion of the meeting, we decided to use T5 for this assignment.

Coding period officially begins!!!

Week 1: After setting up SSH keys and successfully connecting to the CWRU HPC, I started off with some adjustments in the XML parser according to the mentors’ specifications and testing it on a more complex frame like commerce goods transfer. A sample prompt should not contain the whole list of lexical units except the one being referred to in that particular frame. Construction of different types of prompts are also crucial for this task such that it asks the model to describe a frame based on lexical units or the model has to guess the missing bits in definition or frame elements.

Week 2: A better version of XML parser is acheived which transforms the whole XML file into a json file which comprises of all the information given in the XML file in a structured manner. In addition to this, it also creates different prompts out of the information supplied by the XML file and gets saved as a text file. This whole parser is available to use in the CWRU HPC Gallina server as a pipeline in command line interface under the directory /mnt/rds/redhen/gallina/home/rks110 and a sample XML file of frame Exercising is also present in that same directory for testing purpose. The Pipeline is named as xml_pipeline.py. In the terminal, goto the above highlighted directory and type source .venv/bin/activate , using this command a virtual environment will get activated then,python3 xml_pipeline.py Exercising.xml or[XML file path] , this will generate the json and text file about the frame exercising, the virtual environment can be deactivate with a simple command deactivate. This pipeline is not perfect yet but still good enough for basic usage. Since, CWRU HPC is currently under quarterly maintenance till 13th June, so one might not be able to access it or use it until and unless the maintenance work is finished. The prompts saved in the form of text file can be categorised into 3 types.

  1. Perform frame analysis on a given statement or sentence.
  2. Given a frame’s lexical unit or it’s name, describe the frame in terms of its definition, frame elements and other frame relations with it.
  3. Guess the frame or missing bits on the basis of frame definition or frame elements.

These prompts are being created to fine tune T5 model so that the model can grasp and learn about frames effectively. Only 3 kinds of prompts are not enough and more types will be added in near future. All the preprocessing work was performed in this kaggle notebook as FrameNet v1.7 is available publicly in the form of kaggle dataset making it easy to handle and import even after losing the online active kernel.

Week 3: The CWRU HPC Cluster was available for use, following a 3-day maintenance period, after which I upload my XML pipeline over the server and started testing it out with different types of XML file path that can be supplied to it. I found it that BeautifulSoup library was not pre-installed, stating ModuleNotFoundError: No module named ‘bs4’ , so to solve this, it will be discussed in the zoom call on how to proceed as communicated over mail. Two more kinds of prompt are generated which are described as:

  1. Explaining the whole frame where the model will have the context of just the frame name.
  2. Determine the frame and the lexical unit used to evoke it by just analyzing the sentence.

These two prompts were needed as the prototype model was trained upon the former 3 types of prompts and it lacked the understanding of basic frame details and the ability to identify the frame solely. With time, more and more types of prompts have to be thought of for better conversational model based on FrameNet. Then, this new prototype model was tested, and it performed better than the former model. Since fulltext annotation is ongoing so this model isn’t capable of annotating sentences. A Pipeline for fulltext annotation is in the works. Our next task would be to include the fulltext annotated prompt and response pairs in our collection of conversational training dataset. Afterwards, all major journals and research papers on FrameNet and Construction Grammar has to be transformed into group of question answer pairs related to it. These pairs will be generated with the help of Large Language Models available publicly such as Gemini and GPT 3.5. The live link for the prototype model is made available here so that everyone can test it and report any valuable feedback.

Week 4: The folder of fulltext contained corpus of annotated text in the form of XML. A single sentence was annotated into various layers such as part of speech, named entity recognition, white space layer (word sense disambiguation), FrameNet layer and others (Other, Sent, Verb). The most significant layers for our purpose are FrameNet and Named Entity Recognition. A parser was built for extracting these useful layers and their information out of the XML file. The whole procedure of accomplishing it can be found here. Now, two types of prompts were made out of this extracted information.

  1. Model is supposed to annotate the whole sentence on the basis of Frame semantics
  2. Model has to perform Named Entity Recognition on the given sentence

This whole procedure is available to use in the CWRU HPC Gallina server as a pipeline in command line interface under the directory /mnt/rds/redhen/gallina/home/rks110 and a sample XML file ANC__110CYL068.xml is also present in that same directory for testing purpose. The pipeline is named as fulltext_anno_pipeline.py. In the terminal, goto the above highlighted directory and type source .venv/bin/activate , using this command a virtual environment will get activated then,python3 fulltext_anno_pipeline.py ANC__110CYL068.xml or[annotated XML file path] , this will generate the json and text file about the annotated file supplied, the virtual environment can be deactivate with a simple command deactivate. A parsed json file and two text files will be saved containing the output of pipeline. An additional prompt response pair about generating a frame might be useful in future. Afterwards, these new prompts were integrated into the prototype model. This model is available to use publicly and can be found here.

A sample from live prototype model

Week 5: The research papers, total of 27, are mainly based upon FrameNet and Construction Grammar. I used Gemini-pro API to generate question answer pairs out of each research paper. Since, the research papers contained of massive amount of text which exceeded the word limit, they had to be broken down into chunks of 2000 words with additional overlapping context of 200 words from the previous chunk to establish a uniformity. Some of the papers were also based in different languages. So, the prompt given to Gemini was “chunk_text … create atleast 15 pairs of question answer out the text mentioned in english language separated by newline.” Now, the response generated was in the form of plain text but the problem was that there was no specific format in which the question answer pairs were listed. So, a rigorous amount of work went into the preprocessing of the generated response to fetch the list of question answers. All of the pairs couldn’t be extracted but still most of them are collected and work is still going on. The prototype model was noticed to be brittle in the sense that when a question was asked in a different fashion then the model would output nonsense response. So, I went on to learn about Instruction-tuned LLM. I found this article by IBM explaining various aspects of it. T5, the prototype model, is not instruction tuned whereas Flan T5 (latest version of T5) is Instruction tuned meaning it is fine tuned to follow instructions. There are several strategies to tackle this problem of handling variations of how questions can be asked.

  1. Replace some words in prompts with synonyms while keeping the meaning intact. This exposes the LLM to different phrasings without significantly changing the question.
  2. Rephrasing prompts while maintaining the core meaning. This can be done manually or with tools like paraphrasing APIs.
  3. Crafting prompts that instruct the LLM on how to handle variations in phrasing. For example, a prompt like “Answer the following question in a comprehensive way, even if it’s phrased differently: ” followed by your actual question can help guide the LLM.
  4. Include a few examples of different phrasings for the same question in prompts. This can give the LLM a reference point for handling variations.

Other LLMs are also being considered for fine-tuning are Llama 3 and Phi-3 (specially). End of June 2024 witnessed the release of newer version of Phi-3 (3.8B-mini instruct) by microsoft which gave impressive results on the Open LLM Leaderboard comaparable to Llama-3 70B. The main problem with fine tuning these LLMs is the poor outcome of LoRA. Benchmarking all these LLMs with RAG will be result in the deciding model for final fine-tuning on Case HPC.

Week 6: This week started off with the usage of GNU Screen utility. It is a terminal multiplexer, used to multiplex several virtual consoles, allowing a user to access multiple separate login sessions inside a single terminal window, or detach and reattach sessions from a terminal even after disconnection. Then, I began working towards running the Llama 3 8B Instruct model locally on Case HPC. Meta AI’s LLama models are restricted or gated models which require exclusive permission to acquire it. The transformers and accelerate libraries from Hugging Face are pre-requisite for this. Rungit clone https://github.com/meta-llama/llama3.git then, execute the download.sh script inside llama3 directory, passing the URL provided during acquisition when prompted to start the download. Run TRANSFORM=python -c “import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')” , then pip install protobuf && python $TRANSFORM --input_dir ./<model_dir> --model_size 8B --output_dir ./<hf_model_dir>. HuggingFace compatible weights will be made into the <hf_model_dir>. Then it can be as a HuggingFace LLM loaded locally. Now, I started running it locally on Case HPC and started zero shot prompting like What are the data driven approaches to Framenet expansion? and describe the frame “Execute_plan” in terms of its definition, frame elements and other frame relations with it. The response was not upto the mark specially in the case of second prompt resulting the model to hallucinate. Then I leveraged Retrieval Augmented Generation using LlamaIndex where the documents for embedding included the research papers located in /mnt/rds/redhen/gallina/projects/ChattyAI/FrameConstructions and the parsed json frame files located in /mnt/rds/redhen/gallina/projects/ChattyAI/FramesConstructions/fndata-1.7/frame. The results were far better than the previous response.

Prompt 1: What are the data driven approaches to Framenet expansion?

Response 1 (RAG)

Prompt 2: Read the frame ‘Execute_plan’ in the provided FrameNet data, then describe the frame in terms of its definition, frame elements and other frame relations with it.

Response 2 (RAG)

The above response is consistent with all the frame elements (total 14) of “Execute_plan” in FrameNet Data as shown below:

Prompt 3: Please read the frame ‘Execute_plan’, then generate an example based on the frame. Relate all the sentence parts with the frame elements.

Response 3 (RAG)

Prompt 4: Please propose 10 additional different lexical units that evoke the “Execute_plan” semantic frame.

Response 4 (RAG)

In the above response 4, **Enact** means Enact.

Prompt 5: Please propose 10 examples on “Execute_plan” semantic frame.

Response 5 (RAG)

Prompt 6: Annotate this sentence on the basis of frame semantics: “Health authorities say they have put measures into effect in all ports of entry and in centers of provinces.”

Response 6 (RAG)

The above response 6 is consistent with the FrameNet Data as shown below.

Now, we have to establish some kind of evaluation benchmark for LLMs which specialise in Construction Grammar and Framenet so that we can judge models on a metric and rank them accordingly. This will indicate what works for the model and what not works so that our next move can be planned. Also, building a modular LLM is also crucial for evaluation and comparison.

Week 7: I started off with some changes in the Framenet XML parser. The changes included a different naming convention of the keys and addition of Frame URL to the json. Refined JSON of Frame ‘Accompaniment’ is shown below.

illustration of refined JSON of frame ‘Accompaniment’

This will help the LLM in better extraction of meaning out of context provided using RAG. Then I proceeded towards making the LLM (with RAG) able to conduct multi turn conversation. First, I started with the Chat engine provided by LlamaIndex. Chat engine is a high-level interface for having a conversation with your data (multiple back-and-forth instead of a single question & answer). Think ChatGPT, but augmented with your knowledge base. But the results were very poor. It seemed like there was no history of messages. After this, I began implementing my own mutli turn conversation using the suitable tokens used at the time of pretraining Llama 3. Every model has their own sets of tokens which are generally unique to them. The tokens and prompt formats for Llama 3 can be found here. Since, I am using the Instruct version of Llama 3 (better at following instructions) so the following is the proposed system prompt and query wrapper prompt.

Prompt Format of Llama 3 Instruct

Afterwards, I uploaded the whole project on my Github account as a repository named ChattyAI to have a remote backup and to keep a track of the commits to the project. Next, I will be implementing a RankRAG framework and try to construct a Evaluation benchmark for the models. RankRAG introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM’s RAG capabilities. It leverages a small ranking dataset to outperform existing expert ranking models. Shows that a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks.

#Weekly updates are made to this blog as the project progresses.

--

--