Custom-Tailored AI: Leveraging Large Language Models on Your Own Data

Published in

Auraidata

11 min readSep 25, 2023

TLDR: In this exploration of language models, we’ve traced their evolution, from N-grams to transformative Transformers. We’ve witnessed their metamorphosis into reasoning chatbots, a leap forward in human-computer interaction. While promising, these models face limitations, including data drift and hallucinations. To enhance reliability we explored Retrieval-Augmented Generation (RAG) and query decomposition. Practically applied in learning RPGs, these techniques show promise, but document structure nuances are vital for success. In conclusion, language models stand at the forefront of AI. But we need to develop new techniques to bring them in line with our desires.

Language is a complex and ever-evolving means of human communication, and it serves as a common target modality in the realm of artificial intelligence (AI) research. In this article, we will delve into the fascinating world of Large Language Models (LLMs), one of the most prevalent tools for capturing and understanding language. We’ll explore their evolution, current limitations, potential solutions to these limitations, and finally, we’ll illustrate their use case of a chatbot capable of retrieving private information.

The Need to Model Language

Language modeling is the practice of creating a statistical model capable of predicting the probability of sequences of words. This task is pivotal for a wide range of applications, from chatbots that converse with us naturally to translation services that bridge language barriers. Language models are the backbone of these applications, enabling them to predict what comes next in a sequence of words, sentences, or even entire paragraphs.

Evolution of Language models

The overall trend in language modeling has been marked by the idea of being able to take into account an ever growing amount of context. Here, context is the sequence of words on which we conduct the prediction of the next word.

One of the earliest and simplest models to grasp is the N-gram model. These models calculate the probability of a word based on the preceding N words. While they serve as classical conditional probability models, they have limitations. N-grams struggle with capturing long-range dependencies, face data sparsity issues as N increases, and lack semantic understanding, as they don’t model word relationships.
Recurrent Neural Networks (RNNs) represented a significant advancement over N-grams. They introduced the concept of maintaining a hidden state that evolved with each word in a sequence. This innovation enabled RNNs to consider more extensive context, leading to the generation of more coherent text. However, RNNs had their challenges, including difficulties in handling very long sequences and the notorious vanishing gradient problem. Moreover, their sequential nature limited parallelization, impacting throughput.
Long Short-Term Memory (LSTM) Networks were introduced to address some of the limitations of traditional RNNs. LSTMs incorporated a gating mechanism that allowed them to determine when to forget or remember information, enhancing their capability to handle long-range dependencies and mitigating the vanishing gradient problem. Consequently, LSTMs gained popularity in various natural language processing tasks. Nevertheless, their recurrent nature still posed limitations in terms of capacity and throughput.
Transformer Models: The true breakthrough in language modeling came with the advent of transformer models, exemplified by models like GPT (Generative Pre-trained Transformer). This architectural innovation revolutionized the field. Transformers harnessed attention mechanisms, dynamically weighing the importance of each word in a sentence. This dynamic approach empowered them to effectively manage extensive contexts and generate highly coherent text.

Transformers embraced the concept of pre-training on vast text corpora and fine-tuning for specific tasks. This dual-stage process propelled Large Language Models (LLMs) into the spotlight. Pre-training on large datasets endowed these models with broad linguistic knowledge, while fine-tuning customized them for specialized applications such as chatbots, translation, and content generation.

These LLMs, often comprising billions of parameters, possess the remarkable capability to predict the next word not merely based on a few preceding words but on the entire context of a paragraph, or even multiple paragraphs. However, this capability comes at the cost of significantly increased computational resources, data, and training time. Even with these investments, most LLMs based on the transformer architecture can only handle around 4000 tokens (~3200 words) at a time, far from the length of a full book.

Chat models

While LLMs that predict the next word, given the context, are widely usable in applications such as translations. To a layman the transformation of these LLMs to chat models is what makes them truly worthwhile. Transforming LLMs into chat models means that besides mere language modeling we use the billions of parameters to instill a sense of reasoning, by fine-tuning these models to respond to questions and instructions. This fine-tuning process involves training a pre-existing LLM on a specific dataset with a defined purpose. For instance, a general-purpose LLM like GPT-3 can be fine-tuned to respond to questions or follow instructions accurately. Fine-tuning methods can be broadly categorized into two key approaches: reinforcement learning from human feedback and classical fine-tuning.

Reinforcement Learning from Human Feedback (RLHF)

In this approach, LLMs are fine-tuned using reinforcement learning, algorithms which allow models to learn from interactions with their environment, with the following key steps:

Initial Pre-training: LLMs undergo extensive pre-training on a vast corpora of text from the internet. This initial phase equips them with a broad understanding of language, grammar, and context.
Fine-tuning Dataset: To make these models more specific and task-oriented, they are fine-tuned on a smaller dataset that is carefully generated. In RLHF, this dataset often arises from human-AI interactions.
Human Feedback Loop: This is where the reinforcement learning aspect comes into play. The model generates responses to prompts or queries, and these responses are evaluated by humans who provide feedback. The feedback can be in the form of reward scores, indicating how well the model’s response aligns with the desired outcome. The model then adjusts its parameters based on this feedback to improve its responses. It learns to generate responses that align more closely with human expectations, thus becoming more effective at its intended tasks.

Classical Fine-Tuning

In contrast, classical fine-tuning follows a more structured approach:

Initial Pre-training: Similar to RLHF, the model goes through extensive pre-training to grasp the fundamentals of language.
Task-Specific Data: Instead of relying on human feedback for fine-tuning, classical fine-tuning uses task-specific data. This dataset often comprises examples of correct behavior or responses for a particular task, such as prompts paired with the correct responses.
Mimicking Demonstrations: During fine-tuning, the model attempts to mimic the behavior demonstrated in the task-specific data. It learns to generate responses that are similar to the provided examples.

Each of these fine-tuning approaches has its distinct advantages and drawbacks. RLHF allows for dynamic and adaptive learning but necessitates ongoing human assessment. Conversely, fine-tuning with demonstrations results in a specific tone, potentially limiting the diversity of responses.

These fine-tuning techniques enable LLMs to become powerful and flexible tools, capable of providing contextually relevant responses for a wide range of tasks and applications. However, they are not without their challenges and limitations, which we will explore further in the following sections.

Limitations of Chat/Instruct LLMs

Beside the incredible expense of training and performing inference of these large language models. LLMs rely on the data they were trained on, which means they often lack access to the most recent information. This limitation can be a hurdle, especially for businesses and organizations that require up-to-the-minute data to make informed decisions.

LLMs, at their core, are sophisticated next-word predictors. While they excel in generating contextually relevant text, they are not infallible. They can sometimes “hallucinate” or generate content that sounds plausible but is entirely fictional. This poses a challenge when it comes to relying on them for accurate, fact-based information.

Retrieval-Augmented Generation (RAG)

To mitigate these inherent limitations, models would need to be equipped with the ability to retrieve and search for information either online or in private data stores, to prevent the model from drifting from the current data compared to its training data.

To achieve this, we will show how to implement Retrieval-Augmented Generation (RAG) to ground LLMs in current and factual data, thereby reducing the dependency on the models internal knowledge base and reducing its aptitude for “hallucinations”.

But before that, it is important to understand how language is represented in LLMs, as most will know, computers deal with zeros and ones, not with words. Which leads to the concept of embeddings.

Embeddings are fixed-sized vector representations of elements, such as tokens (in the context of natural language processing). These vectors map tokens to unique positions in multi-dimensional space, encoding linguistic properties like word similarity and context. This enables models to understand the relationships between words and phrases. For instance, embeddings recognize that “king” is related to “queen” but not to “car.”

Furthermore, embeddings can be leveraged to relate paragraphs of text to one another based on vector similarity (L2, cosine, inner product). By comparing the embeddings of different pieces of text, we can determine how closely they are related.

This idea of vector search lies at the heart of RAG:

We break down the textual content that we want our LLMs to access into smaller chunks.
Each chunk is transformed into an embedding.
These embeddings are stored in a vector database.

With our data structured in this manner, we can now efficiently interact with LLMs:

When we receive a prompt or query for an LLM, we generate an embedding for it.
We compare this embedding against the embeddings in the vector database, retrieving the closest K chunks.
We provide the prompt along with the retrieved chunks to the LLM.

RAG dramatically reduces the occurrence of “hallucinations” because it equips LLMs with all the relevant facts it needs. It is no longer left to generate information from scratch; instead, synthesizing answers from the information provided, resulting in more reliable responses.

Query Decomposition

To enhance the retrieval process of RAG, one can additionally make use of query decomposition, leveraging the in-built knowledge base of the LLM to intelligently decompose a question into multiple sub questions. Each of these smaller sub questions can then be addressed individually, allowing for a more comprehensive response.

A side effect is that we have the opportunity to address multiple sources simultaneously. This capability enables the model to gather information from diverse, trusted sources, significantly enriching the depth and accuracy of its responses.

While query decomposition empowers language models to provide more detailed and precise responses, it may come at the cost of response time. The chatbot or language model may take slightly longer to generate an answer as it references and combines information from multiple sources. However, this trade-off often results in responses that are highly reliable and contextually rich.

Use Case — Learning New RPGs

Role-Playing Games (RPGs) have captured the imaginations of countless enthusiasts, offering immersive storytelling and dynamic adventures. However, for newcomers to this hobby, navigating the rulebooks and game mechanics can be a daunting task.

In our quest to explore the potential of RAG and query decomposition, we decided to build a chatbot to enable users to query the RPG rulebooks, and so interactively learn the new rules and mechanics.

Archtechture of a RAG-augmented chatbot.

Our initial step involved sourcing several RPG rulebooks, collectively amassing over 4,000 pages of content. Each rulebook became a distinct vector database, with every book divided into smaller, manageable chunks. For each chunk, we generated an embedding using OpenAI’s Ada v2 embedding model. Additionally, we enriched each chunk with metadata, including the types of questions it could address and essential keywords. All this information was meticulously stored in a chromadb vector database, although the choice of vector database is flexible and adaptable.

The next piece of the puzzle involved the creation of a user-friendly chatbot using Streamlit, powered by the OpenAI ChatGPT API. This setup allowed us to leverage the newly released function calling feature, enhancing the chatbot’s capabilities.

At the core of our system lies a vital function usable by ChatGPT, responsible for performing query decomposition and RAG. The query decomposition phase revolves around crafting a specialized prompt to instruct ChatGPT to break down complex questions into a series of simpler sub-questions. Subsequently, we extract these sub-questions from the response and feed each one into the RAG system.

The RAG system processes each sub-question, identifies the most suitable information source, and extracts the K most relevant chunks. For every sub-question, a response is formulated. These responses collectively address the main question, and the resulting answer is provided to the chatbot user.

Our experimentation demonstrated the promising potential of this technique. To validate its efficacy, we established a benchmark consisting of 390 questions, with our system achieving an accuracy rate of 71%. These questions encompassed diverse RPG topics, including character creation, dice mechanics, and combat.

However, the system’s performance was heavily influenced by the structure of the rulebook content. If the question and answer were located in close proximity within the source RPG rulebook, success rates were high. Yet, when rules spanned multiple pages with only initial mentions on the first page, our system couldn’t access the subsequent pages, impacting accuracy. To address this, future iterations should consider not only individual pages but also the document’s structural organization as a whole.

Conclusion

In this journey through language models, we’ve seen their evolution from N-grams to powerful Transformers. These models, when transformed into chatbots, become tools for reasoning.

But they’re not without limitations. They lack access to real-time data and can sometimes generate inaccurate content.

To address this, we explored Retrieval-Augmented Generation (RAG) and query decomposition. RAG grounds language models in current data, reducing inaccuracies. Query decomposition enhances responses by breaking down complex queries into simpler ones.

We put this to the test with a chatbot for learning RPGs, achieving promising results but also recognizing the importance of document structure.

In the ever-expanding realm of language AI, these innovations hold the potential to revolutionize communication and understanding.

Aurai provides custom data solutions that help companies gain insights into their data. We engineer your company’s future through simplifying, organizing and automating data. Your time is maximized by receiving the automated knowledge effortlessly and enacting better processes on a foundation of relevant, reliable, and durable information. Interested in what Aurai can mean for your organisation? Don’t hesitate to contact us!