Enhancing LLMs with Vector Databases

7 min readFeb 10, 2024

originally published on Substack on October 12, 2023

With the rise of LLMs, we are entering an era of non-deterministic computing. Large Language Models, like ChatGPT and Claude by Anthropic, operate on the principles of probabilistic generation and have demonstrated an astonishing capacity to add reason and logic on top of unstructured data. However, the adoption of LLMs within the enterprise has been severely stagnated by a significant set of hurdles. Primarily, model hallucinations, or outputs in which language models produce nonsensical or factually incorrect responses, have introduced instability and eroded trust in the generative capabilities of LLMs.

Medley of Problems w/ LLMs:

LLMs are inherently static. They lack up-to-date information, and the process of updating them is expensive/time-consuming.
LLMs are trained for generalized tasks and lack domain-specific knowledge for the majority of enterprise use cases.
Training and deploying LLMs are technical challenges and resource-intensive. Few organizations possess the financial and human resources to do so at scale.
LLMs function as “black boxes” — it’s not easy to understand which sources an LLM was considering when they arrived at their conclusions.

While still very early in the adoption cycle, companies today are exploring solutions that enable LLMs to work with real-time enterprise data while mitigating the downsides of probabilistic generation. Systems that enable these capabilities will be a significant catalyst for widespread adoption of Generative AI technologies.

Background on LLMs

In one sentence, autoregressive language models generate text by predicting the most likely response to a given input, word by word.

In a little more depth, LLMs are trained on terabytes of plaintext from the web to learn the nuances and structure of text. Then, when an LLM is prompted, the input text is split up and processed through the layers of the language model. Drawing on its knowledge from the training data, the model numerically scores the relationships between the words in the input and uses this information to generate a probability distribution for the most likely first word in the response. The language model then appends the most probable word to the output sequence, and the resulting phrase is iteratively fed back through the model until the desired output length is reached.

Compared to the computing architectures of the past, language models employ a fundamentally distinct approach to information retrieval. LLMs are not databases or search engines, and when prompted with a question requiring domain-specific knowledge, they do not refer to an external knowledge base. Instead, LLMs rely on ‘parametric knowledge,’ or the encoded numerical values of its parameters that serve as the basis for the model’s statistical predictions of the next word. In this way, the information generated by LLMs is purely based on the probability of each word’s occurrence, and therefore, LLMs lack an intrinsic understanding of the text they are generating.

To gain a much deeper understanding of how LLMs work, reference my previous blog post linked here. But in short, the main conclusion is that LLMs generate text probabilistically, and the statistical nature of their language generation capabilities is the root cause of model hallucinations.

The Promise of Retrieval-augmented Generation (RAG)

Retrieval-augmented generation is a promising solution for companies seeking to enrich LLMs with proprietary data without incurring the expenses associated with fine-tuning and training from scratch. When implemented correctly, retrieval-augmented generation goes beyond conventional prompt engineering and can be used to ground an LLM with relevant context, diminish the likelihood of hallucinations, eliminate their inherent static behavior, and enhance traceability (addressing all four major flaws of LLMs).

Retrieval-augmented generation works by introducing an external information retrieval system to function alongside an LLM. Instead of directly prompting the language model and hoping its numerical representation of information will lead to an accurate response, the idea is to first query an external knowledge base to pull up relevant context. This information is then appended to the original prompt to form a comprehensive query, which is subsequently sent to the language model. The LLM then synthesizes all the information in the new prompt (essentially a question-and-answer pair) to formulate a coherent and context-aware response.

Lexical Vs. Semantic Search

The primary challenge in implementing retrieval-augmented generation is developing a retrieval mechanism capable of determining relevant context given a specific query. To overcome this hurdle, today’s implementations of RAG use vector databases, which are specialty databases designed to retrieve data based on semantics.

Unlike traditional databases, which rely on keyword-based search (lexical search), vector databases implement semantic search, a groundbreaking approach to information retrieval that distinguishes entries based on similarity. While lexical search determines the best results in a database by matching specific keywords between the query and entries, semantic search comprehends the underlying meaning of the query, identifying related concepts, synonyms, and even ambiguous terms within the search space.

Semantic Search with Vectors and Embedding Models

Semantic search over a dataset can be implemented by applying an embedding model, a learned transformation that maps an unstructured data type to high-dimensional vectors. Similar to generative AI models, embedding models are neural networks that are trained on large corpora of unstructured data that learn to encode the features and semantic relationships of a dataset into multi-dimensional vectors.

For example, an embedding model can map books to vectors that capture the similarity between the titles based on genre, themes, etc. The book Alice’s Adventures in Wonderland can be represented by the two-dimensional vector (-11,-41), while Dracula can be represented by (-4,-50). The distance between the vectors in this space reflects the closeness of meaning or intensity of the relationship between the books. In this way, the task of semantically searching a database of books can be simply achieved by performing a similarity search over the collection of book embeddings.

Going beyond books, vectors can be used to represent other forms of unstructured data, including images, videos, and text documents. In practice, embedding models use vectors of thousands of dimensions to represent unstructured data. For instance, OpenAI’s text embedding model maps words and sentences to 1536-dimensional vectors. Using higher dimensionality allows the embedding model to better capture the subtle nuances and intensity of relationships between the data.

Implementation of RAG w/ Vector Databases

Getting back to retrieval-augmented generation, vector databases can be used to augment language models by retrieving relevant context at runtime. Company-specific unstructured data, such as product documentation, emails, slack messages, etc., can be run through a text embedding model to obtain their vector representations, which in turn could be stored in a vector database and used as an external knowledge base. Then, when a user prompts a language model, the query can first be sent to the vector database to identify the most relevant contextual vectors to append to the original prompt. This augmented prompt containing both the original question and relevant background information could then be fed into the LLM to generate a response grounded in real-world data.

To illustrate this, take the example of a B2B software company building a chatbot to assist customers in implementing their product.

Vectorize Data: Initially, the company takes its dataset of product documentation and splits it up. The chunks of text are then run through an embedding model, and the resulting vectors are stored in a dedicated vector database, where they are organized and indexed for efficient retrieval.
User Query: When a user interacts with the LLM-assisted chatbot by posing a query (e.g., “How do I troubleshoot network connectivity issues?”), this query is first sent to the vector database and undergoes a similar transformation. The input text in the query is vectorized using the same embedding model, rendering it a numerical vector.
Similarity Search: Using the vector representation of the input prompt as the query vector in a similarity search, the most nearby vectors within the database are retrieved. If done correctly, these vectors represent segments of text from the documentation that explains network connectivity or any other related topics.
Contextualization: The retrieved vectors are then concatenated with the original prompt to form a comprehensive query, which is subsequently used to prompt the LLM.
Natural Language Response: The language model can now synthesize this information to formulate a coherent and context-aware response.

Final Thoughts

Retrieval-augmented generation represents a significant step in bridging the gap between traditional computing and the probabilistic nature of LLMs. RAG can significantly improve the effectiveness of language models by allowing them to tap into vast amounts of real-time enterprise data while also concentrating their area of focus on only relevant context. This helps mitigate hallucinations, enhance accuracy, and introduce traceability to LLM responses.

In the future, generative models will be used throughout the enterprise to facilitate content generation, enhance enterprise search capabilities, and infuse a layer of intelligence and personalization into products. While challenges remain, RAG stands as a promising early solution that can instill trust and readability into the generative capabilities of LLMs.