LLM for the everyday Public Officer

Using RAG with Graph API

Published in

DSAID GovTech

8 min readNov 23, 2023

Great, so we have an on-premise LLM that is usable for government data.

But what about MY data? I could also benefit from having my own meeting scheduling person / document summariser / imaginary assistant*.

Conceptually, what we would need to build such an assistant would be these three ingredients:

First, we need to know what data we have.
Then, we need to know how to get the data we have to the LLM.
And lastly, we need to know how to teach the LLM to use our data.

*This is actually not too far off from what Microsoft 365 Copilot does as well, which was shared at the recent Microsoft Ignite event. See more details about the differences between this and Copilot at the end of the article.

Knowing what data we have

First part’s pretty simple — most of our working data is already stored in the Microsoft M365 ecosystem (SharePoint, OneDrive, Outlook, etc.), so we just have to find a way to get data programmatically from M365.

But what if my data is outside of M365 like in Google Drive?

Well, it shouldn’t be. And even if it is, similar to how we use Graph API which you’ll see later, we can (mostly) do the same for other systems.

Getting the data to the LLM

Up next is to figure out how to get the data to the LLM. Fortunately, Microsoft provides the Graph API that allows us to access these sources fairly easily. Note that Graph API is the name of the product and not directly related to the concept of knowledge graphs or graph data structures in programming.

Source: https://learn.microsoft.com/en-us/graph/images/microsoft-graph.png

Graph API also allows two different types of permissions — delegated and application.

Source: https://learn.microsoft.com/en-us/graph/images/auth-v2/app-privileges-illustration.png

Delegated permissions allow Graph API to retrieve only data that you as a user have access to, so it helps to mitigate the concern that everybody’s data will be exposed. It is also aligned to what we want in a personal LLM. The permissions that we want for a start would be

- Sites.Read.All (For SharePoint sites)
- Mail.Read.All (For Outlook)
- Files.Read.All (For OneDrive)

In order to have a Graph API endpoint, an Azure AD (AAD) application needs to be created. Users can then authenticate to use the API via AAD.

For this setup we opted to only retrieve data in a specific folder (e.g. FEEDME) rather than from all folders to prevent accidental data leakage. As Graph API does not allow filtering within the API call, it has to be done on our application side.

For emails, the setup is currently configured to only read the latest 10 emails in your Inbox folder.

Teaching the LLM to use our data

Now that we have the data and the LLM, how can we make the LLM use our data?

Fine Tuning (FT) and Retrieval Augmented Generation (RAG) are two primary methods for customizing a model. FT is akin to deeply studying a subject like Medicine over an extended period, building up knowledge gradually. RAG, in contrast, is similar to having an open book exam, where you have quick access to a broad range of information and the ability to cite specific references, but you need to sift through the information yourself.

While an open book exam offers immediate access to information and the added benefit of citing sources, it may not be as efficient or accurate as the deep, experiential knowledge a veteran doctor has from years of practice. However, if the subject changes to something like Law, someone who understands the basics, such as English, can use RAG effectively. They can quickly adapt by accessing and referencing relevant legal information, unlike a medical professional who is specialised in a different field.

In the context of training a model, RAG proves more advantageous. Instead of repetitively teaching a model (like explaining John and Jane’s traits 100,000 times), RAG is like giving the model a comprehensive book on John and Jane, complete with references for easy access and citation. This method is more efficient and adaptable, making RAG the preferred choice when reference citation is important.

So what is a typical RAG setup?

RAG enhances LLM output with just-in-time enhanced context. The flow is summarised nicely in this diagram.

Source: https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/jumpstart/jumpstart-fm-rag.jpg

Components powering RAG typically include a vector store and an embedding model.

Data gets separated into chunks, and turned into embeddings before they are stored as vectors. A prompt or question from the user goes through the same process, and is compared (typically via cosine similarity) against vectors in the store to find the most relevant pieces of information.

Source: https://developers.google.com/static/machine-learning/guides/text-classification/images/EmbeddingLayer.png

The embedding model we use is thenlper/gte-large which is also hosted locally. The original chunked text is stored as part of the vector metadata while the vector itself is a bunch of numbers. To enhance data security further, the chunked text (original_text in this case) can be further encrypted prior to storage.

We do not need to encrypt the vectors themselves as it is (for now) extremely difficult to reverse engineer vectors into their source data format. However, this does not mean that the vector database is not protected — it still has the traditional database security controls in place. You can read more about Gartner’s recommendation on this topic here.

Example vector with metadata:

{
  "id": "6a9a08d1-9f3a-4b08-832f-072dfc3bcdb6",
  "vector": [1.5, 2.0, ..., 5.1, 6.3, ..., 9.4, 10.0],
  "metadata": {
    "name": "example_vector",
    "category": "feature",
    "timestamp": 1678901234,
    "original_text": "In the heart of a bustling city, where the neon lights danced with the rhythm of life, Olivia found herself drawn to the enigmatic glow of a small antique shop tucked away in a narrow alley. Intrigued, she stepped inside, greeted by the scent of old books and the soft hum of a vinyl record playing in the background."
  }
}

Other areas to consider for RAG

There are many other aspects to a RAG setup to consider beyond just its components, such as:

Extending context length vs. vector stores
Extending RAG with knowledge graphs
Proper chunking techniques to further preserve semantic information
And many, many more

But here, we’ll focus on feasibility first before tuning for performance.

Putting it all together

After everything discussed above, what does the setup look like? There are two main flows — first is the data ingestion flow, then the regular Q&A flow:

Data Ingestion Flow

RAG Q&A Flow

The above diagrams are simplified so the components are not necessarily all there (e.g. your firewalls, proxies, NATs, what have you). While the LLM server is on-premise, for testing purposes we have connected the LLM to GCC via a VPN relay so that test users can be outside of the VPN network to access the LLM.

The setup will read

All Word, CSV and PDF files in your OneDrive’s FEEDME folder
10 latest emails in Inbox (Only works if the user is on Outlook Web and not Exchange Server)

Even though the ingestion can be automated, we opted to make things a little more manual so that every data ingestion action is intentional, to mitigate any chance of accidentally ingesting wrong or sensitive data.

Demo Setup

To run the demo, one would first need to put files within a special “FEEDME” folder in their OneDrive.

After which, they can head to the App to Login (WEB) and login via AAD to enable the Graph API access. After logging in, a Graph API token will be generated that will be used by the app to retrieve the files from OneDrive and/or Outlook.

After clicking “Get my files” the data ingestion flow above will be triggered on the OneDrive folder.

Similarly, the “Get my emails” button triggers the data ingestion on the latest 10 emails in the user’s Inbox.

The user can then do Q&A on their own data.

While the demo setup is fairly limited, the small group of test users we’ve had do feel that it is useful for their daily use and are excited to see more developments in this area.

Final Thoughts

Building upon our previous work to host a local LLM in the government network, we can now apply it practically in day to day use. Of course, it is nowhere near complete but it demonstrates that the data flow can be achieved and is easily accessible to all public officers without the need for complex onboarding or training.

We can’t yet achieve fully local laptop-only setups since policy prevents us from installing the requisite frameworks to self-host, and hardware limitations would greatly impact the performance, but I believe things will change and it will be a matter of time. Or not?

But wait, isn’t there already Microsoft Copilot?

The purpose of this work is to gain a comprehensive understanding of the underlying mechanisms and functionalities of these services. This knowledge will empower us to develop similar solutions in-house should the need arise. Our focus is not on replicating services that can be readily acquired from industry at a larger scale and with greater efficiency.

Copilot uses Azure OpenAI for its LLM services (i.e. data goes to internet / GCC), so it may not be applicable for certain use cases. This RAG method works even if we host everything on-premise, which is in line with the previous article for higher classified data.

But wait, isn’t this just semantic search?

Yes, a good semantic search is one of the bases for good RAG performance. RAG incorporates the benefits of semantic search and takes it to another level by using the LLM to comprehend and synthesise better results.