GSoC 23' Exploring the Interesection of LLMs and Civic Technologies
In the past three months I have been working for Mayor’s Office of New Urban Mechanics (MONUM) at Boston on a web application that uses retrieval-augmented generative AI for text retrieval and Q&A. The project is part of Google’s Summer of Code program, where contributors work with organizations on open-source projects. By the end of the program, I was able build a prototype that is capable of answering general public inquiries for government related questions.
Motivation
There has been huge excitement around large language models since recently, and as the pioneer of civil technologies, Boston’s MONUM wants to explore the potential use of LLMs in the public sector. The flexibility and scalability of LLMs makes it a natural candidate for text retrieval and Q&A for government workers, when large and sometimes divided public organizations fail to communicate information between themselves.
Due to the nature of the work in the public sector, the responses generated by the model must be credible and the source must be traceable. Part of the problem can be solved by customizing the knowledge base of the LLM. However, even with a custom knowledge base, the language models can still generate inaccurate/misleading responses. This describes the main challenge of the project: building a retrieval-augmented generative AI app that guarantees a high level of accuracy and credibility.
Final deliverables
The code of this project is posted on this github repo. The current working branches are “main” and “azure-free”. The “client” and “server” folders contain the code for respective parts of the project. The current prototype uses a fine-tuned LLM model with a customized knowledge base of multi-format government-related files.
The client side is a React app that enables the user to upload files and tag them with relevant themes, organization that they belong to, and add a short description. All original texts in the file, along with all metadata, are used by Azure Cognitive Search when considering relevant files.
The server side is a Flask app that provides multiple APIs for file upload, retrieval, and generating LLM responses. For example, the “query” API takes the question that the user submitted on the frontend as a query parameter; it first fetches the top 3–5 relevant files from Azure Cognitive Search in vector form (more on this later), then it queries the language model to generate a response based on the retrieved files and their metadata.
Each uploaded file is stored in two places. First, the file is stored in Azure Blob Storage. This enables file download by generating a cloud storage URL specific to each file. Second, each file is turned to vector form and stored in Azure Cognitive Search. This enables Cognitive Search to quickly calculate vector distance and determine which files are semantically closer to the user query.
A separate repo contains a personal side project — a Chrome extension “Shepherd” for augmented text retrieval. The app scrapes the content on the current web page using Cheerio, and convert it to vector form, which was then used as the data for LLM during text retrieval and Q&A. The user can connect to their OpenAI account by entering their OpenAI API key. This information will persist through the browser session and auto-filled upon opening the extension to reduce repeated login. In addition, user is given the power to customize their Shepherd with any GPT-3.5 model. They can change the model as well as reset their API key any time by going to the settings page. User can also hide chat history; with one click they can choose to see only the last Q&A. This feature was developed to give a clean and less clustered user interface.
Demo
Important takeaways
I learned a lot about the inner workings of large language models and technical tools that can help fine tune them for specific purposes. Here are some important takeaways I want to share:
Metadata are crucial in this project. With metadata, files will not only be more accurately retrieved, but responses are also more accurately generated by the language model. Metadata also includes important and otherwise unavailable information such as the relevance score for retrieved files: the response coming back from Azure Cognitive Search contains a floating point number on a scale of 0 to 1.0, that represents how relevant each file is given the user query. In addition, the LLM also gives a numeric value for the usage of each source file when generating responses.
It was surprising to see these numeric values exist to help reflect the accuracy of language models. They exist thanks to the use of vectors. Language models uses vectors to store and comprehend information semantically: the closer the distance between two vectors, the closer their semantic meanings are. There are multiple ways to measure vector distances, including the more well-known Euclidean distance. In this project, the default distance calculation method cosine distance is used. This emphasizes the angle between two vectors regardless of their respective size.
Using LlamaIndex and LangChain, the app is able to process uploaded file into “embeddings”, which are then stored as “chunks” of Document objects. Such document objects contain vectors that represent semantic meanings. In this way, the app can communicate with the language model in terms of vectors.
Skills used:
React.js, Flask, JavaScript, Python
Skills learned:
LangChain, LlamaIndex, Pinecone vector store, Airtable APIs, Azure OpenAI, Cognitive Search, Blob Storage
Future plans
I hope to continue working with the MONUM and work on scaling the application through the use of virtual machines and containers. Now with the access to Boston government’s Azure subscription, I look to deploy the application in near future. This will be exciting because the application could potentially become a part of Boston government’s infrastructure. This means the public sector is beginning to enjoy the benefits of large language models.