GSoC Weekly Update: An Attempt to Integrate with Google Gemma

2 min readJun 25, 2024

GGUF-style model issues

At the beginning of this week’s work, I planned to use the previously trained and quantized Llama-3 GGUF model for subsequent tasks. However, during the development, I encountered the following issues:

When I previously tested the model, I downloaded the GGUF model to my local machine and loaded it into a chatbot framework like GPT4All or LMStudio, which is used for deploying local GGUF models for interaction. However, I did not run this GGUF model in a server-like environment.
While using this GGUF model in Google Colab, I used the Python version of llama.cpp, llama-cpp-python, to decode and load the model. However, this process was very time-consuming and required a lot of resources. The actual response efficiency was not satisfactory.
Due to limited GPU computing power, I encountered multiple instances of garbled and repetitive content generation when using the GGUF model.

Resolution of the above issues

After analyzing the above issues, I concluded that continuing to use the GGUF model was not a good choice. Therefore, I opted to use a merged 4-bit model trained with the Unsloth library as the model type for actual development. The usage method of the Unsloth library is detailed below.

Unsloth GitHub repo: https://github.com/unslothai/unsloth

Finetuning code: https://github.com/XXXJumpingFrogXXX/Chatbot-for-Chat-Activity/blob/main/Llama3_Finetuning.ipynb

Previous Blog:

GSoC Weekly Update: My Project Journey with Sugar Labs

Introduction

medium.com

I have also uploaded the latest fine-tuned model to HuggingFace.

https://huggingface.co/Antonio27/llama3-8b-4-bit-for-sugar

Gemma-combined workflow

In the current workflow, I have incorporated Google Gemma 7b, another open-source model that performs reasonably well in the text processing domain. The current workflow is designed as follows: The fine-tuned Llama-3 8b model is used for the initial response generation to the user’s query. Then, the initial response is fed into the Google Gemma 7b model for adjustment to ensure that the final processed content is more suitable for children and to prevent the occurrence of garbled and repetitive content.

The code designed with this workflow runs correctly. The URL for the code in Colab is as follows: https://colab.research.google.com/drive/1ZBoGQPipPD3owetIOBJEDMqzsBriLnMc?usp=sharing

Result conclusion

I have created a new Google spreadsheet that stores five questions related to complex concepts and their corresponding responses. It can be observed that the latest responses have shown some improvement, but not significantly. Occasionally, the generated replies still contain some complex terms. The primary issues here are mainly due to the small model parameters and limited computing power. This problem cannot be resolved immediately at this stage, so I will focus on becoming familiar with the Chat Activity code and the design of the UI/UX this week.