使用Ollama與MemoryKernel客製化實現本地RAG應用Embedding模型連結

Ian Chen

Published in

playtech

11 min readMay 28, 2024

前幾天再次接獲讀者詢問，如果想要在本地端實現 Embedding Model 連接該如何實現，要解決這個問題有一點點小麻煩，本篇就來說明如何實現。

由於在本地端實現模型部署有許多不同方式，因此先說明一下環境的部份：

模型部署：採用 Ollama 運行在容器內
Vector DB：向量資料庫採用 Qdrant，同樣運行在容器內
RAG AP : 使用 .NET Console APP 示範

有關 Ollama 與 Vector DB 請參考前二篇文章教學。本次範例 Embedding Model我選用的是 snowflake-arctic-embed，而生成式模型則選擇Microsoft的phi3。

下載模型部署，在 Ollama 啟動運行下，直接使用curl叫用API把模型Download下來。

curl http://localhost:11434/api/pull -d '{"name": "snowflake-arctic-embed"}'
curl http://localhost:11434/api/pull -d '{"name": "phi3"}'

若是想知道目前已Download在本地端的模型，一樣可以叫用API取得模型清單。

curl http://localhost:11434/api/tags

(工商服務 )

如果你不知道什麼是RAG、向量資料庫，那麼我與二位朋友合著的ChatGPT開發書籍，有專章細部講解什麼RAG以及向量資料庫。書本銷售：極速 ChatGPT 開發者兵器指南：跨界整合 Prompt Flow、LangChain 與 Semantic Kernel 框架

這次的任務是將 Embedding Model 部署在本地端，並實作 RAG 應用。如果你看過前一篇文章，可能會直覺認為這應該與部署生成式模型相似，但其實這裡會遇到一個問題。目前，Ollama API 相容於 OpenAI Chat Completions API，但並不相容於 OpenAI Embedding API（官方表示未來會相容）。這間接導致在使用 MemoryKernel 實現 Embedding Model 連接時，無法直接使用內建的 OpenAI 連接器。因此，必須改用 CustomEmbeddingGenerator 來實作客製化連接，即自行呼叫 Ollama Embedding API，並將其整合到 MemoryKernel 中以實現向量處理。

再來，如果想實現中文化的應用，由於目前市面上的 Embedding Model 幾乎對繁體中文的支援度很差，即使有支援中文，也多以簡體中文為主。因此，要找到一個好用的繁體中文 Embedding Model 並落地部署並不容易。如果真的要應用，可能需要自行進行微調（finetune）。本文將以英文 Model 為例，旨在說明如何在 MemoryKernel 中實現客製化 Embedding Model 的連接。

首先，在 MemoryKernel 裡除了內建支援的連接器之外，ITextEmbeddingGenerator 介面也可以自行實作。事實上，內建的連接器也是該介面的具體實作，這充分展示了 MemoryKernel 作為 SDK 的高度抽象化和彈性能力。

建立 OllamaEmbeddingGeneratorConfig 類別，做為連接 Ollama Embedding API 相關參數值，其中 MaxToken 的值具體要看你採用的 Embedding Model而訂，MemoryKernel 內部預設值是1000。

public class OllamaEmbeddingGeneratorConfig
{
    public int MaxToken { get; set; } = 4096;
    public string Endpoint { get; set; }
    public string EmbeddingModel { get; set; }
}

2. 建立 OllamaEmbeddingGenerator 類別，並實作 ITextEmbeddingGenerator 介面。GenerateEmbeddingAsync 方法的內部實現具體呼叫 Ollama Embedding API。根據 Ollama Embedding API 的規格，我們可以發現它需要兩個參數：Model 和 Prompt。Model 指的是 Embedding Model 的名稱，而 Prompt 則是要進行向量處理的文字內容（在 RAG 應用中，這就是知識庫的內容）。

public async Task<Embedding> GenerateEmbeddingAsync(
        string text, CancellationToken cancellationToken = default)
{
    var requestBody = new { Model = _config.EmbeddingModel, Prompt = text };
    var content = new StringContent(JsonSerializer.Serialize(requestBody), Encoding.UTF8, "application/json");
    var response = await _httpClient.PostAsync(_config.Endpoint, content, cancellationToken);

    if (response.IsSuccessStatusCode)
    {
        var responseContent = await response.Content.ReadAsStringAsync();
        var responseJson = JsonSerializer.Deserialize<Dictionary<string, List<float>>>(responseContent);
        var embeddingArray = responseJson["embedding"].ToArray();
        return new Embedding(embeddingArray);
    }
    else
    {
        throw new KernelMemoryException("Failed to generate embedding for the given text");
    }
}

Ollama Embedding API 端點是http://localhost:11434/api/embeddings，具體回應結果是一個 JSON 物件，其中的 embedding 屬性即為向量值，也就是準備用來放入 Vector DB 的資料。

{
  "embedding": [
    0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
    0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
  ]
}

3. 接著實作RAG AP示範，Embedding Model我選用的是 snowflake-arctic-embed，而生成式模型則選擇Microsoft的phi3

//Chat Embedding Model Config
  var ollamaEmbeddingConfig = new OllamaEmbeddingGeneratorConfig
  {
      Endpoint = "http://localhost:11434/api/embeddings", // for OllamaEmbedding
      EmbeddingModel = "snowflake-arctic-embed"
  };

  //Chat Model Config
  var ollamaGenerationConfig = new OpenAIConfig
  {
      Endpoint = "http://localhost:11434/v1",
      TextModel = "phi3",
      APIKey = "0"
  };
  var kernelMemory = new KernelMemoryBuilder()
                      .WithOpenAITextGeneration(ollamaGenerationConfig)
                      .WithCustomEmbeddingGenerator(new OllamaEmbeddingGenerator(ollamaEmbeddingConfig) { })
                      .WithQdrantMemoryDb("http://localhost:6333")
                      .Build<MemoryServerless>();

4. 知識庫向量處理，這裡簡單示範一段文字，實際也可以轉入 pdf 或 web page，而這些都是 MemoryKernel 內建可以處理的。

static async Task ImportKm(MemoryServerless memory)
{
    await memory.ImportTextAsync(@"By Susan M. Niebur with David W. Brown, Editor
When it started in the early 1990s, NASA’s Discovery Program represented a breakthrough in the way NASA explores space. Providing opportunities for low-cost planetary science missions, the Discovery Program has funded a series of relatively small, focused, and innovative missions to investigate the planets and small bodies of our solar system.
For over 30 years, Discovery has given scientists a chance to dig deep into their imaginations and find inventive ways to unlock the mysteries of our solar system and beyond. As a complement to NASA’s larger “flagship” planetary science explorations, Discovery’s continuing goal is to achieve outstanding results by launching more, smaller missions using fewer resources and shorter development times.
This book draws on interviews with program managers, engineers, and scientists from Discovery’s early missions. It takes an in-depth look at the management techniques they used to design creative and cost-effective spacecraft that continue to yield ground-breaking scientific data, drive new technology innovations, and achieve what has never been done before.
", documentId: "nasa-ebook");
}

5. 具體結果，Q：What is NASA’s plan for Discovery?

整體來說，只要解決模型部署問題，藉由 MemoryKernel 的幫助，要實現全落地版本的 RAG 應用並不難（有關 RAG 的具體效果是另外的議題）。這也是自去年開始關注 Semantic Kernel 與 LangChain 這類 SDK 後，深有感觸的地方。這些工具大大減少了開發與整合的時間。否則，光是處理不同模型的向量、API 參數的不同，以及 Prompt 的處理，這些零零總總的事情就會耗費不少時間了。當然，一些基本知識還是需要補足，例如什麼是向量、Prompt 的技巧、embedding 是什麼等。

完整範例程式碼，同樣發佈在 github上，有需要的朋友們可以自行參考。歡迎斗內一下我們的新書，給還願意出版繁體中文技術書籍的作者們一點動力 :)

github ：完整範例程式碼

使用Ollama與MemoryKernel客製化實現本地RAG應用Embedding模型連結

Written by Ian Chen