Retrieval Augmented Generation with PgVector and Ollama

Building a Knowledge Base chat app with HuggingFace Transformers, LangChainJS and Ollama

17 min readJan 14, 2024

While large models such as ChatGPT and Anthropic’s Claude may seem like the go-to, they are part of a closed-source development loop and subscriptions are needed to access their advanced features, such as document chat. Likewise, other tools online also require subscriptions to use their functionality, which simply wraps API calls to OpenAI or other providers. In this article, I’ll discuss how I attempted to mimic some of their functionality with models running locally (on your own laptop/machine) using Ollama.

I hope that this de-mystifies the process of chatting with documents and also the prompting and retrieval techniques that you can leverage to do so.

Ollama

Ollama is a desktop application that streamlines the pulling and running of open source large language models to your local machine. With its’ Command Line Interface (CLI), you can chat directly with models in your terminal:

Or you can fetch streaming responses via its REST API:

Ollama simplifies the interaction with models such as Llama2–7B through 70B, Mistral-7B, and many more. The only requirement is that your device or even graphics card has enough RAM to load the model weights.

For a 7B param model, Ollama recommends 8GB of RAM, while a 70B param model may require 64GB of RAM, and may cause some latency in response times.

Also, it is worth noting that the REST API exposed has certain Cross-Origin-Resource-Sharing (CORS) policies that only accept requests from localhost. To allow additional requests from external IP addresses or Docker containers, the user will need to run Ollama in 1 terminal window, and run ollama serve in another terminal window with the OLLAMA_ORIGINS variable set to your application’s IP, as well as the OLLAMA_HOST variable set to expose a non-conflicting port with the one provisioned already by Ollama, which is http://localhost:11434.

Pulling Model Weights

Before interacting with the model, the user will need to run the command:

# i.e. ollama pull mistral
ollama pull <MODEL_NAME>

This will pull the weights to your local machine for Ollama to run the inference. Do note that if you are running Ollama via Docker, you will have to pull the weights into a Docker Volume attached to the Docker container.

For our application, we first pull the Mistral model with the command ollama pull mistral.

Building the Retrieval Augmented Generation (RAG) Interface in a Javascript Web Application

In interfacing with similar models, you will mostly see examples of Jupyter Notebooks or Python code online, none of which can run in the browser. Furthermore, with most of these examples, they merely call existing APIs to pass data to in-memory variables or external sources, none of which make it transparent for the development of custom, production ready deployments for applications.

In this article, I will demonstrate how I developed a RAG solution that uses Langchain.JS to interface with models on Ollama within web application code, as well as persist the data to disk with PostgreSQL and Docker.

For brevity, I will skip all the install steps for Node packages. For a full reference of what I used, you can refer to my package.json file here.

Retrieval

To enable the retrieval in Retrieval Augmented Generation, we will need 3 things:

Generating Embeddings
Storing and retrieving them (with Postgres)
Chunking and Embedding documents

1. Generating Embeddings

Embeddings are vector representations of text content that are created by Transformer-based models. These models are trained on text and can convert text to number outputs represented as vectors.

Using similarity-search between vectors, we can query for documents with a text query, without matching the document entirely (think of it as full-text search). This is done by converting the documents to embedding vectors, and then converting the query to a vector to perform a vector-based search based on distance between the query vector and the documents (cosine distance, euclidean distance, etc.) and returning the documents with the closest distance to the query vector.

Since the documents and the query vectors are both generated using the same embedding model, it is likely that the result is close to the query in terms of meaning, or contains similar terms.

Setting it up in Next.JS 14 on the server

In my app, I opted to generate the embeddings on the server to minimise load times for the client. This meant that the server could pre-bundle the ONNX embedding model binaries from Huggingface and run them in the app with the ONNX Node Runtime. (ONNX is a separate model weights format that enables running without library code such as PyTorch).

To start, Webpack is used by Next.JS as a bundler for source code files to be parsed by the Webpack engine. We do not want the model files to be parsed through Webpack as they are binary .node files. Hence, we configure our Next.JS configuration file like so:

// next.config.js

/** @type {import('next').NextConfig} */
const nextConfig = {
  //... All your other configuration
  experimental: {
    serverComponentsExternalPackages: [
      'sharp', 
      'onnxruntime-node' // Important
    ],
  },
}

module.exports = nextConfig;

We add the experimental serverComponentsExternalPackages configuration to define the packages used to run the embedding model within our app, telling Next.JS not to bundle the code and binaries. Then, we can freely run and cache the embedding model binaries in our server-side code with no issues.

Initialising the Embeddings Model

We use LangchainJS’ HuggingFaceTransformersEmbeddings class, which wraps the @xenova/transformers (Or Transformers.JS) pipeline for performing inference with Huggingface models in applications.

In using Transformers.JS, we will need models which support the ONNX runtime, as Transformers.JS utilises the ONNX runtime for Node. Most of the embeddings models from the official Transformers.JS model repository on Huggingface will work with this code.

To start, we use the model Xenova/all-MiniLM-L6-v2 which outputs vectors of 384 dimensions. This strikes a good balance between quality and inference speed. We have also tested Xenova/gte-base (which is an ONNX port of thenlper/gte-base, ranked #12 on the Massive Text Embedding Benchmark) with slightly slower inference but better retrieval.

We also define a singleton and attach it to the global object to preserve it between hot-reloads of the Next server:

import { HuggingFaceTransformersEmbeddings } from '@langchain/community/embeddings/hf_transformers';

const getHuggingFaceEmbeddings = () =>
  class HuggingFaceEmbeddingSingleton {
    // Choose this instead if your machine can run it on large documents
    // static model = 'Xenova/gte-base'; // Output: 768 dimensions
    static model = 'Xenova/all-MiniLM-L6-v2'; // Output: 384 dimensions

    static instance: HuggingFaceTransformersEmbeddings | null = null;

    static async getInstance() {
      if (this.instance === null) {
        this.instance = new HuggingFaceTransformersEmbeddings({
          modelName: this.model,
        });
      }
      return this.instance;
    }
  };

export type THuggingFaceEmbeddingSingleton = ReturnType<typeof getHuggingFaceEmbeddings>;

let HuggingFaceEmbeddingSingleton: THuggingFaceEmbeddingSingleton;

if (process.env.NODE_ENV !== 'production') {
  if (!global.HuggingFaceEmbeddingSingleton) {
    global.HuggingFaceEmbeddingSingleton = getHuggingFaceEmbeddings();
  }
  HuggingFaceEmbeddingSingleton = global.HuggingFaceEmbeddingSingleton;
} else {
  HuggingFaceEmbeddingSingleton = getHuggingFaceEmbeddings();
}

export default HuggingFaceEmbeddingSingleton;

Now, when we import this instance within our application code, we can generate embeddings using this one singleton, ensuring it is only loaded once.

For those using TypeScript, you may wish to edit your environment.d.ts file to add the typings on the global object:

// environment.d.ts

// where I stored my singleton, yours may vary
import type { THuggingFaceEmbeddingSingleton } from '@/lib/models/embeddings/huggingfaceEmbeddings';

declare global {
  namespace globalThis {
    var HuggingFaceEmbeddingSingleton: THuggingFaceEmbeddingSingleton | undefined;
  }
}

export {};

2. Storing and Retrieving the Embeddings

For our Postgres store to receive the documents and embed them, we will need to first configure Postgres.

To perform this, you will first need to install Docker.

Configuring Postgres

For my local development, I used Docker to simplify the setup, and here is the Dockerfile:

# Dockerfile

FROM ankane/pgvector

COPY *.sql /docker-entrypoint-initdb.d/

This uses the official PGVector image to enable the storing of vectors to our Postgres database. We also include an init.sql file to initialise the PGVector extension:

-- init.sql

CREATE EXTENSION IF NOT EXISTS vector;

And we run it with docker-compose with the following config file:

# docker-compose.yaml

services:
  postgres:
    hostname: postgres
    build:
      context: .
      dockerfile: ./.local/cluster/postgres.Dockerfile
    image: "llmchat-db"
    container_name: "llmchat_db"
    environment:
      POSTGRES_DB: "llmchat"
      POSTGRES_USER: "locallm"
      POSTGRES_PASSWORD: "locallm"
      PGDATA: "/data/llmchat-db"
    volumes:
       - "llmchat-db-docker:/data/llmchat-db"

    # To access via localhost, and prevent conflicts with our Postgres
    ports:
      - "5431:5432"
    restart: unless-stopped
volumes:
    llmchat-db-docker:
      external: true

By running these commands, we can initialise and start the database:

# Initialise the volume (first time only)
docker volume create llmchat-db-docker

# Start the cluster
docker compose up -d

Migrating the Schema

I used this SQL code to generate the schema for the embeddings (LangchainJS expects this):

CREATE TABLE IF NOT EXISTS "embeddings" (
 "id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
 "created_time" timestamp DEFAULT now(),
 "content" text,
 "metadata" jsonb,
 
 -- change this to 784 if using gte-base
 "embedding" vector(384),
 CONSTRAINT "embeddings_id_unique" UNIQUE("id")
);

I used drizzle as my Object Relational Mapper for TypeScript, but you can choose otherwise.

Configuring the Vector Store for interfacing with LangchainJS

Now, we can configure our vector store, courtesy of LangchainJS. They provide a PgVectorStore class to abstract out the interactions with their Langchain ‘chains’ in the retrieval of vectors.

Likewise, we initialise this as a singleton to minimise application load:

// vectorStore.ts

import HuggingFaceEmbeddingSingleton from './embeddings/huggingfaceEmbeddings';

const getVectorStore = () =>
  class VectorStoreSingleton {
    static instance: PGVectorStore | null = null;
    static async getInstance() {
      if (this.instance === null) {
        const embeddings = await HuggingFaceEmbeddingSingleton.getInstance();
        this.instance = await PGVectorStore.initialize(embeddings, PgVectorStoreConfig);
        // Initialise cleanup on initial
        process.on('beforeExit', () => {
          this.instance?.end();
          this.instance = null;
        });
      }
      return this.instance;
    }
  };

export type TVectorStore = ReturnType<typeof getVectorStore>;

let VectorStore: TVectorStore;

if (process.env.NODE_ENV !== 'production') {
  if (!global.VectorStoreSingleton) {
    global.VectorStoreSingleton = getVectorStore();
  }
  VectorStore = global.VectorStoreSingleton;
} else {
  VectorStore = getVectorStore();
}

3. Embedding Documents

Now, we can use Langchain within our application to chunk and embed our source documents.

We first try out the PDF workflow to embed our documents. LangchainJS provides a few utility classes, such as PDF Loaders and Text Splitters. In our code below, we implement 2 steps:

A PDF processor to load PDF data from source file blobs
A text chunker to chunk the PDF data into text chunks.

import type { Document } from 'langchain/document';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

type TMetadata = {
  title: string;
  fileType: string;
  totalPages: number;
  roomKeys: {
    [key: string]: boolean;
  };
};

export type TChunkMetadata = Omit<TMetadata, 'totalPages'> & { splitNumber: number };

type TProcessedDocument = Document<TMetadata>;

export type TProcessedChunk = Document<TChunkMetadata>;

/**
 *
 * @param files An array of `File` instances, one for each PDF file uploaded.
 * @param roomId The room to grant initial access to for these files.
 * @returns An array of `Document` instances, one for each PDF file. We will
 *   generate one array of `Document` instances per `File` instance.
 */
export const processPDFFiles: (
  files: Array<File>,
  roomId: string
) => Promise<TProcessedDocument[][]> = async (files, roomId) => {
  return await Promise.all(
    files.map((file) =>
      new PDFLoader(file, { splitPages: file.size > 500_000 }).load().then((docs) => {
        return docs.map((doc) => ({
          ...doc,
          metadata: {
            title: file.name,
            fileType: file.type,
            totalPages: doc['metadata']['pdf']['totalPages'],
            roomKeys: {
              [roomId]: true,
            },
          },
        }));
      })
    )
  );
};

/**
 *
 * @param documents A nested array of `Document` instances 
 * @returns A nested array of chunks, one for each source file,
 *   with each chunk containing metadata about their source file.
 */
export const getTextChunks: (
  documents: Array<Array<Document>>
) => Promise<Array<Array<TProcessedChunk>>> = async (files) => {
  const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50 });
  const splits = await Promise.all(
    files.map((pages) =>
      // 2D Array of splits
      Promise.all(
        pages.map(
          (page) =>
            splitter.splitDocuments([page]).then((splits) =>
              splits.map((split) => ({
                ...split,
                metadata: {
                  title: page.metadata.title,
                  fileType: page.metadata.fileType,
                  roomKeys: page.metadata.roomKeys,
                },
              }))
            )
          // 1D Array of splits
        )
      ).then((res) =>
        res
          .flatMap((pageSplits) => pageSplits)
          .map((split, splitIndex) => ({
            ...split,
            metadata: { ...split.metadata, splitNumber: splitIndex + 1 },
          }))
      )
    )
  );
  return splits;
};

Then, we can simply call our PgVectorStore defined earlier to upload the documents. It will handle the embedding for us.

// We maintain a room bitmap for each file to grant access to rooms.
const documents = await processPDFFiles(files, roomId);

const chunks = await getTextChunks(documents);

// Insert into vector store.
const vectorStore = await VectorStoreSingleton.getInstance();
const successfulInserts: string[] = [];
const errors: string[] = [];
for (const currentDocument of chunks) {
  if (currentDocument.length === 0) {
    continue;
  }
  const { title } = currentDocument[0].metadata;
  try {
    await vectorStore.addDocuments(currentDocument);
    successfulInserts.push(`${title}`);
  } catch (error) {
    errors.push(currentDocument[0].metadata.title);
  }
}

This is done sequentially to minimise load on our database. If your database can handle more load, you can offload this embedding and uploading process to other worker threads.

We define this in an API route for our local app as a demonstration, but you may wish to offload such processing to a Thread/Worker Pool for production applications for reduced latency.

This Thread/Worker Pool will include everything from the Embedding Model defined earlier to the file processing and uploading as defined above. Then, you just need to configure the thread pool to receive documents.

Augmenting and Generating

To implement the base functionality for our chat, we set up a few variables for the frontend to interface with our model:

An array to maintain the chat history
Reusable interfaces for our Chat APIs to use
Prompt templates

Setting Up the Ollama interface

Like earlier when we set up the Vector Store, we will now need to set up another Singleton to interface with Ollama.

Here’s the code we used to do so:

import { ChatOllama } from '@langchain/community/chat_models/ollama';

const getOllamaSingleton = () =>
  class OllamaSingleton {
    static model = 'mistral';
    static instance: ChatOllama | null = null;
    static async getInstance() {
      if (this.instance === null) {
        this.instance = new ChatOllama({
          baseUrl: process.env.OLLAMA_BASE_URL,
          model: this.model,
          numCtx: 32678,
        });
      }
      return this.instance;
    }
  };

export type TChatOllamaSingleton = ReturnType<typeof getOllamaSingleton>;

let ChatOllamaSingleton: TChatOllamaSingleton;
if (process.env.NODE_ENV !== 'production') {
  if (!global.ChatOllamaSingleton) {
    global.ChatOllamaSingleton = getOllamaSingleton();
  }
  ChatOllamaSingleton = global.ChatOllamaSingleton;
} else {
  ChatOllamaSingleton = getOllamaSingleton();
}
export default ChatOllamaSingleton;

Setting Up the Prompts and Response Streams

To chat effectively with the model, we can use the prompt templates that our chosen model, Mistral, has been fine-tuned with:

It uses <s> / </s>tokens to signal the start/end of exchanges with a user.
It uses [INST]/ [/INST tokens to signal the start/end of user instructions.

Then, according to the different scenarios, we may feed in different prompt templates according to the chat context:

If the user is sending their first message, and no document has been uploaded.
If the user is sending a message in a chat that has no document
If the user is sending their first message with document(s).
If the user is sending a message in a chat with document(s).

Scenario #1 — Initial Question

[INST]
  Tell me a joke about llamas.
[/INST]

To represent the first exchange with the model, we simply wrap our instruction with this format.

To generate a stream that we can return to our frontend, we can use this code:

import { PromptTemplate } from 'langchain/prompts';
import { StringOutputParser } from 'langchain/schema/output_parser';

const baseTemplate = `
[INST]
{question}
[/INST]`.trim();

/**
 * Given an initial question, returns the model's answer as a stream.
 *
 * @param question The question asked by the user
 * @returns The stream from the model
 */
const getAnswerStream = (question: string) => {
  const prompt = PromptTemplate.fromTemplate(baseTemplate);

  const ollama = await ChatOllamaSingleton.getInstance();

  const chain = prompt.pipe(ollama).pipe(new StringOutputParser());

  return chain.stream({ question });
}

Scenario #2 — Chat History with no Document

<s>
  [INST]
  Tell me a joke about llamas.
  [/INST]
   Why don't llamas like rainy weather?
   Because it makes them really llama-ted! (too wet)
</s>
[INST]
  Now tell me one about chickens.
[/INST]

For a scenario where the user has sent and received multiple messages from the model, we can simply wrap each exchange from the chat history in the format as shown above, and append them all to the front of our instruction, like so:

<s>
  /if
  [INST]
  {USER PROMPT #1}
  [/INST]
  /endif
  /if
  {SYSTEM RESPONSE #1}
  /endif
</s>
...
<s>
  /if
  [INST]
  {USER PROMPT #n}
  [/INST]
  /endif
  /if
  {SYSTEM RESPONSE #n}
  /endif
</s>
[INST]
  {LATEST USER PROMPT}
[/INST]

Mistral will then recognise this flow as a conversation, without any additional prompting.

Do note that due to Mistral’s limited context window of 32678 tokens, this may degrade the responses with longer chat history. You may wish to summarise long sequences with additional prompts for your use cases with smaller models.

Extending from the code in Scenario #1, it will not differ too much. To adapt it, simply add in parameters to pass in your chat history, and do some pre-processing to format your chat history in the format as above.

In edge cases where certain responses are not paired (i.e. standalone questions, or standalone system responses), you may remove the user prompt or system prompt portions accordingly.

Scenario #3 — Initial Question from Documents

This is where we need a separate workflow to retrieve the documents from the Vector Store, and embed them within our prompt.

The base summary of the prompt would look something like this:

<s>
You are an experienced researcher, expert at interpreting
and answering questions based on provided sources. Using the
provided context, answer the user's question to the best of your
ability using the resources provided.
Anything between the following \`context\` html blocks is retrieved from a 
knowledge bank, not part of the conversation with the user.
<context>
  {context}
<context/>
REMEMBER: If there is no relevant information within the context,
just say "Hmm, I'm not sure." Don't try to make up an answer.
</s>
<s>
[INST]
{question}
[/INST]
</s>

Then, using our Vector Store defined earlier, we use it to query for relevant documents based on the question, and format them as a multi-line string to insert in the {context} placeholder.

To do this, we can use LangChain Chains. Here’s how we did this for this scenario:

import { RunnableSequence } from 'langchain/schema/runnable';
import { StringOutputParser } from 'langchain/schema/output_parser';
import { PromptTemplate } from 'langchain/prompts';

// Where I store my utility interfaces, yours may differ
import ChatOllamaSingleton from '@/lib/models/chat/chatOllama';
import VectorStore from '@/lib/models/vectorStore';

// Copy in the prompt from above
const baseDocumentQATemplate = `
...
`;

// Given a list of results, format it into the prompt.
const formatDocsAsString = (docs: Document[]) => {
  return docs
    .map(
      (document, index) => 
        `<doc id='${index}'>${document.pageContent}</doc>`
     )
    .join('\r\n');
};

/**
 * Given a user's question and the ID of their chat room,
 * retrieves the relevant documents and the model's response
 * as a text stream.
 *
 * @param question The question to be asked
 * @param roomId The chat room to filter documents for.
 * @returns A text stream to stream the model's response;
 */
const getBaseQAStream = (
  question: string, 
  roomId: string
) => {
  const vectorstore = await VectorStore.getInstance();
  const retriever = vectorstore.asRetriever({ 
    filter: { 
      roomKeys: { 
        [roomId]: true 
      } 
    } 
  });
  
  const retrievalChain = RunnableSequence.from([
    (input: { question: string }) => input.question,
    retriever,
    formatDocsAsString,
  ]);
  
  const model = await ChatOllamaSingleton.getInstance();
  
  const fullChain = RunnableSequence.from([
    {
      question: (input: { question: string }) => input.question,
      context: RunnableSequence.from([
        (input: { question: string }) => 
          ({ question: input.question }), retrievalChain
      ]),
    },
    RunnableSequence.from([
      PromptTemplate.fromTemplate(baseDocumentQATemplate),
      model,
      new StringOutputParser(),
    ]),
  ]);

  return fullChain.stream({ question });
}

By just calling fullChain.stream({ question: '<your question>' }) on your question, you effectively have a stream that will give you your desired output as a text stream, which can then be piped to the frontend.

Scenario #4: Chat History, with Documents

This is where it gets interesting. As the natural flow of the conversation goes, the user may ask questions while referring to earlier parts of the conversation, leading to questions such as “Then give me some examples demonstrating this.”.

If we were to pass such questions directly to the vector store, then we wouldn’t get many relevant documents. This is because the vector store does not know what “this” means.

Hence, we would need an additional chain to query our model for a standalone question that can be used to query the store. This would take in the chat history, and return a standalone question:

export const chatHistoryReflectTemplate =
`[INST]
  Given the following conversation and a follow up question
  rephrase the follow up question to be a standalone question.
  
  Chat History:
  {chat_history}

  Follow Up Input: {question}
  
  Standalone Question:
[/INST]`;

//... All our previous variables

const retrievalChain = RunnableSequence.from([
  PromptTemplate.fromTemplate(chatHistoryReflectTemplate),
  model,
  new StringOutputParser(),
  retriever,
  formatDocsAsString,
]);

The output from retrievalChain will then be passed into our Question-Answer Template from Scenario 3 as the question for the vector store.

To get the full stream, we can amend the function from Scenario 3 to add the retrieval chain as an initial step:

// ... All imports and variables from earlier

type TChatMessage = {
  persona: string,
  content: string
};

const formatChatHistory = (
  messages: Array<TChatMessage>
) => {
  if (messages.length === 0) {
    return '';
  }
  if (messages.length === 1) {
    const message = messages[0];
    if (message.persona === 'user') {
      return `<S>[INST]${message.content}[/INST]</s>`;
    }
    return `<s>${message.content}</s>`;
  }
  const [first, second, ...rest] = messages;
  
  if (first.persona === 'user' && second.persona === 'system') {
    return `<s>[INST]${first.content}[/INST]${second.content}</s>`
      + formatChatHistory(rest);
  }
  return formatChatHistory([first]) + formatChatHistory([second, ...rest]);
};

const documentQATemplate = `
You are an experienced researcher, expert at interpreting
and answering questions based on provided sources. Using the
provided context, answer the user's question to the best of your
ability using the resources provided.
Anything between the following \`context\` html blocks is retrieved from a 
knowledge bank, not part of the conversation with the user.
<context>
  {context}
<context/>
REMEMBER: If there is no relevant information within the context,
just say "Hmm, I'm not sure." Don't try to make up an answer.
`;

// ... other variables and imports defined earlier
const chatDocumentQATemplate = `<s>
${documentQATemplate}
</s>
{history}
[INST]
{question}
[/INST]`;

const fullChain = RunnableSequence.from([
  {
    question: (
      input: { 
        question: string, 
        chat_history: Array<TChatMessage>
      }
    ) => input.question,
    history: RunnableSequence.from([
      (input: { 
        question: string, 
        chat_history: Array<TChatMessage>
      }) => input.chat_history, 
      formatChatHistory
    ]),
    context: RunnableSequence.from([
      (input: { 
        question: string, 
        chat_history: Array<TChatMessage>
      }) => {
        return {
          question: input.question,
          chat_history: input.chat_history
            .map(
              (message: TChatMessage) => 
                `${message.persona?.toUpperCase()}: ${message.content}`
            )
            .join('\r\n'),
        };
      },
      retrievalChain,
    ]),
  },
  RunnableSequence.from([
    PromptTemplate.fromTemplate(chatDocumentQATemplate),
    model,
    new StringOutputParser(),
  ]),
]);

const stream = (question: string, history: Array<TChatMessage>) => 
  fullChain.stream({ question, chat_history: history });

This chain does three things:

Formats the chat history into the retrieval Chain to get the standalone question
Using the standalone question, retrieve relevant documents from the vector store.
Formatting the context, the chat history into the instructional format, as well as the final question into the final prompt.

Now, you can easily compose all of the intermediate steps, retrievals, and model calls into one chain fullChain , via LangChain’s Chain API. Similarly to the above scenarios, we call fullChain.stream on our question and chat history to return the text stream.

Integration into Application Code

Now, given all the above functionality, we have all the APIs and tools to call our Mistral model running on Ollama.

We simply need expose the relevant functions/streams to our frontend in Next.JS to accept the user’s input and stream back the response accordingly.

In REST API route handlers, we can pipe the responses through to the frontend by wrapping the stream in a Node Response :

const stream = getStream(question);
return new Response(stream);

And we can then query it from our frontend:

const response = await fetch('/api-route', {
   // ... other options
});

const stream = response?.body?.getReader();
let isStreamFinished = false;

while (!isStreamFinished && stream) {
  const { done, value } = await stream.read();
  if (done) {
    isStreamFinished = true;
    break;
  }
  const chunk = new TextDecoder().decode(value);
  //... other logic for using the value
}

You’ll also notice in the earlier segments that I did not use the RetrievalQAChain from LangChain to do so, because I wanted to customise the way messages are inserted into the database within my backend.

For this, I provisioned additional tables, schema and backend logic to persist my messages, and also allowed for my documents to be queried by the different chat rooms to which they are linked, via this schema:

type Document = {
  //... all the relevant fields
  metadata: {
    title: string;
    fileType: string;
    splitNumber: number;
    roomKeys: {
      [key: string]: boolean;
    }
  }
}

Then, to check if a document is linked to the room, we can query its’ roomKeys property if it contains that chat room’s ID as a key.

We can also use SQL queries to retrieve all the messages for a Room to display it to the UI, or retrieve all documents linked to a Room.

Conclusion

This app is a fun experiment to demonstrate working with LangChain and HuggingFace Transformers in Next.JS, or in JavaScript apps for that matter. I hope you have found it useful in implementing your own web apps or backends that interface with not just Ollama via LangChain, but also other chat model APIs in LangChain that also expose the stream interface.

For more, you can view the source code here: https://github.com/SeeuSim/local_llm_chat