GSoC2024-Final-Report

Yufei

3 min readAug 23, 2024

This is a final-submission report of GSoC2024. This year, I participate in Renhen Lab.

Here is Github Repo link. Here is project link.

Project Goal

Develop a multilingual large language model based on real word news transcriptions.
Develop an RAG pipeline based on unseen-data. Compare rag performance of off-the-shelf model and our model.
Injecting Guardrails for data security.

Among them, RAG pipeline and Guardrails are my tasks.

What I did

Exploring probability of audio-rag. I tried to embed audios using audio model, and compute similarity between audio embedding and (transfered or not transfered by MLP) text embedding. The result was that it can’t retrieve the correct context. This task is pended.
Exploring multimodal rag. Read this artical.
Develop an RAG pipeline using Langchain. This pipeline has the following features:

3. Evaluate RAG pipeline with these metrics:

context_recall, the extent to which the retrieved context aligns with the ground truth, detail

context_precision, evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not, detail

answer_correctness, a higher score indicates a closer alignment between the generated answer and the ground truth. detail

answer_relevance, alower score is assigned to answers that are incomplete or contain redundant information, computed using the question, the context and the answer. detail

4. Design data security guardrails: guardrails is a protective technique that check input / output / context before response.

Project current state

RAG eval results: https://docs.google.com/spreadsheets/d/1qowvVmzlXf1syaTUDBfOdpeU_ZvenPNSLmbO1x0nIIQ/edit?usp=sharing

Guardrails finished.

Further work

Higher context recall: try to use BM25 + embedding similarity, combine them with ensemble retriever
More languages
Multi-modal
Create RedHenLabs/news-reporter-3b Endpoint

Chanllenges

Low context recall & model does not answer the question: change base model, add metadata, set chunk_size 756 & chunk_overlap 150
High context recall but low answer relevance when using ensemble retriever: ensemble retriever returns reranked contexts from every sub-retriever. So it is reasonable to just selecting the first several (to be fair, select 4) contexts.
How to ensure data security using guardrails? Set up a guardrail that instructs the LLM to refuse to respond when the context recall is below a certain threshold (default is 0.5). (still trying…)
Output with prompts: set param `return_full_text = False`

Gains

Nice experience, nice collaborators. Difference cultrue experience.
Using of Langchian & LCEL
How to preprocess data for RAG
Using of RAGAs, ability of learning from official docs
Using of openai-api, Google Collab, gdown tool
Essay writing skills. Blog writing skills, how to keep blogs clean
Improving skills of using git tool
Develop habits of record project progress
Communicate in time and frequently.
Push code and backup anything in time.

In conclusion

I faced many challenges during this project, including network issues, server disconnections, time constraints, and perplexing bugs. Fortunately, with perseverance and occasional help from my team members, I was able to overcome them all. This project has been incredibly valuable, and I want to emphasize: don’t delay experiments out of fear of encountering exceptions or failures. Ensure you have a backup in place, and then dive in. Don’t be afraid of facing challenges; communicate with your teammates, and you’ll always find a solution.

Thanks

Karan Singla, who always help my and give me support.
Sridhar Vanga, who also listen to my report every week.
Tarun, who helps me a lot on debugging, trains our model and provides context and metadata.
CWRU, cluster provider.