Naturally Learn and Progress App
This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.
Authors: Alice Saparov, George Cruz, John Wu, Scotty Smith, Tarek Aloui
Table of Contents
- Introduction
- Getting the Data
- API
- Modeling
- Frontend
- Deployment
- Challenges
- Next steps
- References
Introduction
The objective of this project is to bridge the tutoring gap in specialized learning, making it more accessible and customizable to each individual learner. We aim to mitigate this challenge by leveraging Language Learning Models, thus offering a robust digital platform to complement the irreplaceable guidance of human tutors. We accomplish this by:
- Developing a data processor that scrapes textbook data for information and review questions
- Training and fine-tuning LLMs to be subject-specialized and able to produce Q&A, summaries, and term definitions
- Creating front-end for a streamlined user experience, providing a “tutor” for any user and any subject desirable
- Generating an API to ensure seamless interactions between the back-end and front-end.
We are excited to expand on all of these components of our app below, and discuss our thoughts and challenges throughout the semester.
Getting the Data
In order to enhance the capabilities of our large language model (LLM), we leveraged a comprehensive data pipeline using Vertex AI. The initial step involved the data collector component, which scrapes data from the internet, focusing on a variety of book subjects. This data is then fed into the data processor component, which is responsible for parsing, cleaning, and structuring the data, making it more conducive for the LLM’s training. The parser reads files (primarily JSON), extracts relevant text, and performs necessary cleaning to ensure the quality and relevance of the data. Methods like walk_and_read_files, extract_text, and clean_extracted_text highlight a thorough approach to handling the data, ensuring that only the most pertinent and clean information is retained. Additionally, the class includes functionalities to map subjects to their respective textbooks, thereby organizing the data efficiently and making it more accessible for the model.
We added a synthesizer component to perform several functions like loading the environment, initializing Google Cloud Storage (GCS) client, downloading data, aggregating prompts, labeling data using GPT-3.5 Turbo, and finally, uploading the data. These steps are meticulously designed to handle the data efficiently, ensuring that it is relevant, well-organized, and ready for the subsequent processing stages. Each function within the synthesizer, from load_environment to upload_data, plays a specific role in transforming raw data into a structured format suitable for training the LLM.
Additionally we used QLoRa fine-tuning to reduce the amount of parameters needed to train and converted to 4-bit precision as mentioned in the paper QLoRA: Efficient Finetuning of Quantized LLMs (Fig. 1) . QLoRa models run inference through a base + adapter where training is only done on the adapter model. Our overall efforts did not yield promising results as the base models gave us still better outputs. Our FT adapters are also all available via Hugging Face hub as yuw38/falcon-7b-term, yuw385/falcon-7b-small, yuw385/falcon-7b-med as PEFT models.
API
For our API implementation we used the popular python package FastAPI. This enabled us to connect our React frontend to our python backend in an easy to understand and straightforward manner. We leveraged the REST API protocol as it allows for streamlined debugging and testing backend APIs using command line tools like curl.
We implemented one type of API call called POST which is typically used for file upload and processing. Once the file is uploaded to the backend, it takes some time to process it through our model before rendering the model output. To achieve a simple user experience, we utilize the asynchronous nature of the React framework so that we can process the file uploaded and seamlessly generate the output as it is returned by the model working in the backend. We hosted this server through the uvicorn server.
The routines we implemented were:
- Convenience API call for UI
This was a combined endpoint called /uploadfile/ to handle the file upload and make API requests for our three LLM tasks of summarization, term extraction, and Q&A generation. This API assumes the external endpoints return JSON responses and creates a final response based on the processed results.
- Task specific API calls
For users interested in interacting with our model with command line tools, there are three task specific API calls available. Our API features specific endpoints for our individual NLP tasks, including /uploadfile_qa/, /uploadfile_sum/, and /uploadfile_terms/. All of these endpoints receive a file, subject, and user input through form data, prepare a payload for API requests, and interact with API model endpoints for the respective tasks of generating Q&A, summarizing, and extracting key terms. The responses from these API calls are then processed using our format_response function to return a final constructed response.
More specifically…
/uploadfile_qa/
Executes the model to generate questions and answers based on the file uploaded and subject selected by the user. It returns a formatted (JSON) response with the Q&A to the user.
/uploadfile_sum/
Executes the model to generate a summary depending on the file uploaded and subject selected by the user. It returns a formatted (JSON) response with the summary to the user.
/uploadfile_terms/
Executes the model to generate key terms related to the file uploaded and subject selected by the user. It returns a formatted (JSON) response with these terms to the user.
Modeling
The core of our project revolved around training the LLM to perform various tasks, such as text categorization, summarization, and question answering. This was achieved through the development of specialized classes and methods, tailored to meet the unique demands of each task. By employing techniques like prompt engineering and fine-tuning, the goal was to ensure that the model could generate accurate and contextually relevant responses.
Our decision to select Falcon-40B was primarily influenced by its impressive performance in key benchmarks (HellaSwag and TruthQA) relevant to our application. This includes its capabilities demonstrated on the HellaSwag dataset, which is critical for assessing context awareness in language models, and TruthfulQA, a benchmark that evaluates the accuracy and truthfulness of a model’s responses. However, we quickly realized that the 40 billion parameter model was too large of a model to potentially train and run inferences without large computational resources. We compared the falcon-40B and falcon-7B’s performance and found them to be comparable and worked from there.
We started off by building a comprehensive pipeline, beginning with the installation of necessary libraries like Transformers, Accelerate, and LangChain. Then we proceeded to load and prepare the LLM (Falcon-7b-instruct) for the task, ensuring it was optimized for our specific requirements. The implementation of ChromaDB involved creating a database of embeddings, crucial for enhancing the LLM’s retrieval capabilities. This setup was instrumental in facilitating efficient handling and processing of large volumes of text data.
Given the large-scale nature of data in LLM training, we leveraged ChromaDB’s ability to handle and process embeddings at scale. It ensured that the LLM can access the necessary semantic information without facing bottlenecks in data retrieval or processing. Also, we utilized RAGS (Retrieval Augmented Generation for Sequences) as a pivotal tool in augmenting the language generation process. This approach enhanced the model’s ability to pull in relevant context or data as needed during the generation process. It essentially allows the LLM to ‘look up’ information before generating a response, leading to more informed and accurate outputs. The key technical advantage of RAGS lies in its ability to dynamically use retrieved information to augment the generation process. This feature is particularly beneficial in scenarios where the model needs to generate responses based on up-to-date or extensive external data that is not inherently part of its pre-trained knowledge.
In summary, by using RAGS and ChromaDB, combined with focused task-specific fine-tuning and prompt engineering, we integrated a comprehensive and technically advanced approach to developing a highly capable and efficient Large Language Model.
Frontend
Our front-end design was developed using React, a free and open-source JavaScript library based on the creation of components.
Our UI is showcased below!
The user journey begins on the home page. Here they are able to read a short description of what our app offers. On the top right, they are able to navigate back to the Home page or to our Upload page to upload their personal materials that they would like to use to create a personalized tutoring guide.
Navigating to the Upload page…
First, the user should select a subject of interest based on the options available in the drop down menu. This will select a corresponding model to use in the backend that is better trained for the subject. Once the user has selected their subject of interest, they are able to upload a file for which they would like to have study materials generated. The user can add some short text to describe the context of the content they are interested in studying, and then finally click the upload button.
This will send the request to our backend to begin generating study materials!
Taking a look at an example of some results…
In this example, the user has selected History for their corresponding subject of interest. This connects to our Falcon model on the backend as shown near the green circle on the right. Next, the user has uploaded a history.json for which they would like study materials generated. After, entering American History for user context our app returns a brief summary about this subject, Q&A that can be used for studying, as well as key terms from the subject. Now the user is ready to review and study!
Overall, we aimed to create a clean and simple user interface. Studying can be very scary and overwhelming, and we hope that our app allows for a more friendly and calming approach with its simple blue and grey tones.
Deployment
We deployed our front-end and api-service via Ansible under our ac2152023_Naturally_Learning_and_Progress/src/deployment. The Ansible deployment was primarily for our frontend-react website and our backend api-service. The deployment includes our nginx, GCP VM configurations, and our docker image configurations so that upon deployment our instances and applications run smoothly.
We have also added some CI/CD workflows thanks to Github Actions. Under `.github/workflows/build_containers.yml`, we have included the specifications for a job that builds some of our containers and pushes them automatically to our Dockerhub in order to keep track of the different versions for each container. When a push event occurs, Github Actions will automatically checkout-out the latest version of our code to make it available for the subsequent workflow steps. We then set up the docker environment in our Ubuntu image and log into the Dockerhub account using our credentials (username and private token) which were saved as secret variables in the Github repository. Finally, after building the frontend and backend images, we push them to the Dockerhub account to keep track of the new changes.
Challenges
While we tried our best to develop and deploy an accurate and strong-performing model, we did run into challenges.
Hallucinations →
A well-known problem with LLMs are hallucinations. We observed that our model hallucinated, and this is likely due to limited training data and insufficient computational resources for fine-tuning. Limited training data can lead to inadequate coverage of certain topics, resulting in the model’s inability to understand and generate accurate responses in those areas. This limitation is particularly noticeable in niche or highly specialized subjects where available data is scarce or not diverse enough. The model, therefore, may rely on overgeneralizations or incorrect patterns it has learned from the limited dataset. Our models used were rather small with limited token limits which resulted in less than ideal outputs when it came to tasks outside of the base models. During our QLoRA fine-tuning, our “yuw385/falcon-7b-xxx’ model(s) were all hallucinating quite frequently and performing much poorer than the base model. We attempted to train our model on both the QA, Terms, and summaries specifically, but were still not able to improve its performance over the base model.
Training Data →
Another well known concept in machine learning models is “Garbage in, Garbage out” :) While not all textbooks are free and publicly accessible, we attempted to find reasonable materials to train our models on. We made efforts to pre-process and clean our training data, however, some subjects, like STEM subjects, include many equations, figures, diagrams, etc. in textbooks and given our limited time and computational tools it was not fully feasible to develop models capable of handling all subjects. We initially focused on History as our main subject since it generally includes a large amount of text and we hoped it would be more interpretable to an LLM for training. We still offer other models, but their performance is largely questionable.
Computational Resources →
Training our LLM involved fine-tuning numerous hyperparameters, ensuring efficient parallel processing, and managing memory usage to prevent bottlenecks. This complexity necessitates a high level of expertise in parallel computing, distributed systems, and machine learning. For instance, it involved using distributed computing techniques to spread the workload across multiple GPUs, which requires sophisticated orchestration and synchronization. Additionally, the long duration of training, not only tied up these resources for extended periods but also increases the risk of interruptions or failures, which can result in significant setbacks. LLMs are infamous for memory hogs and ours was no different. We included steps to reduce the training and inference by quantizing our models to 4 bits on most layers.
Next Steps
To enhance the performance of our large language model (LLM) in the face of limited training data and computational resources, a strategic, multi-faceted approach is essential. Firstly, we aim to focus on subject-specific fine-tuning, which involves concentrating efforts on areas where the model currently underperforms. This targeted fine-tuning is not only more efficient in terms of resource usage but can also lead to significant improvements in those weaker areas. In scenarios where computational resources are constrained, we would adopt an iterative fine-tuning process. This method entails gradually fine-tuning the model on a smaller scale, progressively integrating more data as resources become available, allowing for continuous improvement over time.
In addition, a robust validation and testing protocol is crucial for maintaining the model’s accuracy and reliability. We aim to iImplement a comprehensive testing regime that includes cross-validation against reliable sources and a diverse array of test cases to ensure the model’s outputs are consistently accurate. Additionally, establish a feedback loop to identify and analyze any incorrect outputs or hallucinations. This feedback can then be used to inform subsequent training iterations, turning errors into valuable learning opportunities for the model. This continuous learning process helps in refining the model’s responses and reducing inaccuracies.
Furthermore, to supplement our existing resources, we will consider exploring external collaborations and technological solutions. Partnerships with academic or research institutions can provide access to larger datasets and additional computational resources, enriching the training process. Utilizing premium cloud computing platforms offers another viable solution, as these platforms can provide scalable resources tailored for intensive tasks like model training and fine-tuning. Finally, we plan to engage in continuous monitoring and regular updates post-deployment. Regularly assessing the model’s performance helps identify any new issues or areas for enhancement.
References
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023, May 23). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv: [arXiv:2305.14314 ]