Local LLM in the Browser Powered by Ollama

7 min readNov 22, 2023

Over the last week, I spent some extra cycles tinkering with different ideas on how to insert LLM capability into the browser. After a couple of starts and stops, I landed on an implementation that felt like a decent proof-of-concept, but still functionally useful.

Lumos 🪄

Lumos.

Lumos is a Chrome extension that answers any question or completes any prompt based on the content on the current tab in your browser. It’s powered by Ollama, a platform for running LLMs locally on your machine. When you prompt Lumos, data never leaves your computer. Inference happens locally without the support of an external LLM provider (e.g. OpenAI). Running an LLM locally means your questions are kept private and you don’t pay anyone. A local LLM is critical to bridging the adoption of language models in the browser.

I experimented with Chromium and prototyped other designs, but eventually concluded that a Chrome extension strikes a fair balance between ease of use and development and functional capability. It meets all the criteria for serving practical use cases and is still easy enough to develop for anyone with basic knowledge of React and JavaScript.

The approach is by no means perfect. However, it demonstrates the utility and potential of LLMs in the browser.

Why LLM in the Browser? 🤔

Simply put, having an LLM embedded in the browser has much better ergonomics than repeatedly copying and pasting content from one tab to your ChatGPT window. Why not just summarize the article right in the current tab?

Moreover, because content on the internet is changing by the second, language models are out of date as soon as the next post is made. A basic RAG LLM architecture is implemented in Lumos so you can…

summarize long threads on issue tracking sites, forums, and social media pages
extract highlights from breaking news articles or capture the essence of long-form opinion pieces
ask questions about restaurant and product reviews
condense technical documentation for ease of consumption

These are just some of the use cases that are enabled by having an LLM conveniently embedded in the browser. Over time, the feature is sure to unlock more unforeseen value.

The Basics 📓

The core implementation of Lumos is simple. A script is injected into the current tab to retrieve content on the page. The content is passed along with the user’s prompt to the extension’s background script for processing. The entire RAG LLM workflow is executed in the background and the completion response is forwarded to the extension’s main thread for rendering.

Ollama 🦙

Ollama is a platform for running LLMs locally. Specifically, Lumos relies on the Ollama REST API. The extension calls the API to generate embeddings (POST /api/embeddings) and perform inference (POST /api/generate). Download the installer, install the CLI, and run the command:

OLLAMA_ORIGINS=chrome-extension://* ollama serve

The environment variable OLLAMA_ORIGINS must be set to chrome-extension://* to bypass CORS security features in the browser.

user@machinename % OLLAMA_ORIGINS=chrome-extension://* ollama serve
2023/11/22 09:25:43 images.go:799: total blobs: 6
2023/11/22 09:25:43 images.go:806: total unused blobs removed: 0
2023/11/22 09:25:43 routes.go:777: Listening on 127.0.0.1:11434 (version 0.1.10)

The local server is hosted on port 11434 by default.

Script Injection 💉

To inject a script into the current tab, make sure that the activeTab and scripting permissions are specified in the extension’s manifest file.

{
    "manifest_version": 3,
    ...
    "permissions": [
        "activeTab",
        "scripting"
    ],
    ...
}

The following code demonstrates how to inject a script and process the retrieved content.

// script to be injected
const htmlToString = (selector: any) => {
  if (selector) {
      selector = document.querySelector(selector);
      if (!selector) return ""
  } else {
      selector = document.documentElement;
  }

  // strip HTML tags
  const parser = new DOMParser();
  const doc = parser.parseFromString(selector.outerHTML, "text/html");
  var textContent = doc.body.innerText || "";

  // Use a regular expression to replace contiguous white spaces with a single space
  textContent = textContent.replace(/\s+/g, " ");

  return textContent.trim();
};

// script injection
chrome.tabs.query({ active: true, currentWindow: true }).then((tabs) => {
  var activeTab = tabs[0];
  var activeTabId = activeTab.id;

  return chrome.scripting.executeScript({
    // @ts-ignore
    target: { tabId: activeTabId },
    injectImmediately: true,
    func: htmlToString,
    args: ["body"]
  });
}).then(async (results) => {
  const pageContent = results[0].result;
  // process the page content here
}).catch((error) => {
  console.log(`Error: ${error}`);
});

Background RAG

The extension’s background script performs all the magic. A RAG LLM chain, implemented with LangChain, is invoked for each prompt. LangChain’s JavaScript framework provides an interface to Ollama and an in-memory vectorstore implementation. The library can be incorporated easily into any Chrome extension. The following code is the entire background.ts file for Lumos (full source linked below).

// background.ts
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { PromptTemplate } from "langchain/prompts";
import { Ollama } from "langchain/llms/ollama";
import { OllamaEmbeddings } from "langchain/embeddings/ollama";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { StringOutputParser } from "langchain/schema/output_parser";
import { RunnableSequence, RunnablePassthrough } from "langchain/schema/runnable";
import { formatDocumentsAsString } from "langchain/util/document";


const OLLAMA_BASE_URL = "http://localhost:11434";
const OLLAMA_MODEL = "llama2";
var context = "";

chrome.runtime.onMessage.addListener(async function (request) {
  if (request.prompt) {
    var prompt = request.prompt;
    console.log(`Received prompt: ${prompt}`);

    // create model
    const model = new Ollama({ baseUrl: OLLAMA_BASE_URL, model: OLLAMA_MODEL });

    // create prompt template
    const template = `Use only the following context when answering the question. Don't use any other knowledge.\n\nBEGIN CONTEXT\n\n{filtered_context}\n\nEND CONTEXT\n\nQuestion: {question}\n\nAnswer: `;
    const formatted_prompt = new PromptTemplate({
      inputVariables: ["filtered_context", "question"],
      template,
    });

    // split page content into overlapping documents
    const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 500,
      chunkOverlap: 0,
    });
    const documents = await splitter.createDocuments([context]);

    // load documents into vector store
    const vectorStore = await MemoryVectorStore.fromDocuments(
      documents,
      new OllamaEmbeddings({
        baseUrl: OLLAMA_BASE_URL,
        model: OLLAMA_MODEL,
      }),
    );
    const retriever = vectorStore.asRetriever();

    // create chain
    const chain = RunnableSequence.from([
      {
        filtered_context: retriever.pipe(formatDocumentsAsString),
        question: new RunnablePassthrough(),
      },
      formatted_prompt,
      model,
      new StringOutputParser(),
    ]);
    
    // invoke chain and return response
    const result = await chain.invoke(prompt);
    chrome.runtime.sendMessage({ answer: result });
  }
  if (request.context) {
    context = request.context;
    console.log(`Received context: ${context}`);
  }
});

Initialization of the Ollama model and OllamaEmbeddings abstractions require specifying the Ollama REST API hostname (http://localhost:11434) and model name (llama2). Underneath the hood, LangChain’s implementation instantiates all of the necessary scaffolding to connect to the API and issue requests to the local server.

Decent Performance 🥈

The performance of the application is acceptable. On an M2 MacBook Pro with 16 GB of RAM, generating embeddings and performing inference can take anywhere between 30 seconds and two minutes depending on the size of the content on the page and the complexity of the prompt. Further optimizations should be made to the core application. For example, a page’s content can be pre-processed into the vectorstore on page load instead of waiting for the user to submit a question or prompt.

A Failed Experiment? 🕳️

Local Server Deployment

We can’t expect everyone to know how to run a local server. This may not be obvious for even technical individuals. This requirement seems like a clear limiting factor for wider adoption of local LLMs in the browser. To be fair, Ollama already makes installation and setup dead simple. It’s just not easy enough for our parents.

Chrome Security is Too Strong

Before settling with Ollama, I used Web LLM as the local LLM provider for Lumos. Web LLM offers a JavaScript library for interfacing with various LLMs. After failing to work around Chrome’s security mechanisms for blocking remote code execution (i.e. WASM) in extensions built on Manifest V3, I conceded to the approach of running a local server.

I even attempted to run the Web LLM chat module from a sandboxed <iframe> with no luck. Instead of running two local servers from two separate applications, I decided to migrate Web LLM to Ollama.

Messy Content, Bad Data

The approach for retrieving content from the current tab is generalized to work for any webpage. Unfortunately, one size does not fit all. In many cases, the implementation extracted irrelevant content (e.g. ads, page navigation text, in-line scripts) into the vectorstore. In the future, some form of arc90readability should be applied to extract only relevant content from a webpage. Additionally, specialized scraping logic can be developed for specific websites.

Dynamic Content Indexing

Because every website contains different amounts of data and different types of data, it’s imperative to optimize the parameters for indexing the content into the vectorstore. chunkSize and chunkOverlap are statically set, but every website can probably have its own chunking values based on the size and type of content it contains. Sites with long-form content may need larger chunks and larger overlap, but sites with short and frequent posts may need smaller chunks and no overlap. Dynamically setting these parameters was out of scope for this project. Again, one size does not fit all.

LLM in the Browser

This post from Aditya pushes the idea of LLM in the browser to the next level. If a browser natively has LLM capability, a plethora of new use cases can be unlocked. The expanded set of use cases may push changes to the underlying protocols of the internet.

Should generative AI tags be added to the HTML standard?

<genai type="img" prompt="generate an image of Lumos" />

<genai type="text" prompt="rewrite the article in the preferred style of the reader">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
  incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
  nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
  eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
  sunt in culpa qui officia deserunt mollit anim id est laborum.
</genai>

With an LLM in the browser, maybe all sites come with free accessibility features that allow the browser to regenerate content in a manner that suits the user. Stock photos for Medium articles become dynamic, produced by generative AI on the fly and never stored on a platform’s servers. Maybe boring and dense news articles are rewritten in your favorite style of fiction (Harry Potter?).

Lumos demonstrates the ability to act on existing content on the internet, but the current implementation falls short of its full potential. Chrome is a wildly powerful platform to build on. It may be possible to build all of the aforementioned features in a Chrome extension. However, we should pause and examine whether we need a new Chrome altogether.

Nox.