Exploring hybrid search and custom models in OpenSearch on TrueFoundry (Part I)

12 min readJun 6, 2024

Introduction

In recent months, the community surrounding OpenSearch has been developing a collection of Machine Learning features that are in high demand these days. Hybrid search is one of the most interesting features among them.

Today, we're going to quickly develop a basic hybrid search application on TrueFoundry following one of the OpenSearch examples. Later in this series, we're are going to learn how to host a custom Hugging Face model in seconds thanks to the TrueFoundry platform.

Part I: a first pass

We are going to iterate over our solution a couple of times. The idea is to replicate one of the OpenSearch documentation examples end-to-end as a first part and to experiment with custom models in a second part. "End-to-end" includes deploying a nice and simple Streamlit application to be the visible face of our system. All will be hosted together on TrueFoundry.

In the second part, we will host a custom model on CPU at a very low cost on TrueFoundry. The model will be ready to be fine-tuned, A/B tested, redeployed on GPU, scaled out, or even switched to another model from the model catalog.

Hosting a single node of OpenSearch

Once you have registered on your TrueFoundry account following the official documentation and have created your default workspace following the "Create and set up your account" guide, click on the "+New Deployment" button on the top right.

Choose to deploy a "Service" on the left panel, select your workspace and then choose "Code for Docker image" and then "Deploy from a public Docker image". Finally, click “Next".

Now, let's fill in the description and image parameters of our service. Choose the name "opensearch" as the name of the service instead of the default. Select the "Docker image (Deploy an existing image)". For us, the user image will be "opensearchproject/opensearch:2.14.0". This is the latest version at the moment. You can pick a different version from the DockerHub tags page.

Remove the "Command Override", as we are going to accept whatever command there is already configured in the Docker image, and change the HTTP port from 8000 to be 9200 instead. Uncheck the "Expose" checkbox as we won't expose the node publicly. We will use a Jupyter notebook to control the node. For the "Resources", let's leave everything on CPU and request 2000 MB of memory with a limit of 4000 MB. Additionally, let's also make the upper storage limit higher, 2000 MB should be fine.

In environment variables, we need to add two variables: one is "discovery.type=single-node" (keep a close eye on punctuation), so OpenSearch doesn't spend time listening for other nodes. The other one is "OPENSEARCH_INITIAL_ADMIN_PASSWORD" for which we will use a random password generated online. Ideally, we should use the secrets feature from TrueFoundry but, for the sake of time, we're going to skip it right now.

Hit "Submit". You should see the progress of the deployment on the TrueFoundry UI like this. Eventually the "ACTIVE VERSION" will turn green indicating that the deployment succeeded.

Deployment status in the deployments dashboard.

While this happens, let's start a Jupyter notebook kernel and create a script to check that the service is running as we expect.

Hit "+New Deployment" on the top right. Go to "Notebook/VSCode/SSH" on the left panel and select your workspace on the right. "Jupyter Notebook" should be selected by default, to let's hit "Next" to start the process. Enter a name for it. "jupyter" for instance. Select "JupyterLab Base Image" in the "Image" section. Provide an admin password for it and hit "Submit".

Go to the "Deployments" view and look for the "Notebook/SSH" tab. Find your service and press the "Endpoint" button. You should be redirected to a fresh Jupyter notebook environment. It may take a minute to start. So if you need to leave, you might want to click the play button on the last column of the panel to restart the kernel.

Notebook tab on the deployments dashboard.

Once the Jupyter kernel is ready, let's test if the OpenSearch is visible in a notebook. Open the "Endpoint" to see the Jupyter application and create a new Jupyter notebook by double clicking on ".conda-jupyter-base".

Let's write our first cell to store the password in the notebook:

import getpass
opensearch_password = getpass.getpass("opensearch_password:")

After hitting ENTER, a textbox for entering a password should open. Enter the same value that we generated for the environment variable "OPENSEARCH_INITIAL_ADMIN_PASSWORD".

For the next cell, we need the HTTPS endpoint of OpenSearch. Open the deployments dashboard of TrueFoundry and hover over the "endpoint" label on the OpenSearch service. Ensure that it starts with "https://" instead of "http://", or simply change it if so required.

Endpoint label to hover over and copying the endpoint URL.

Write this as a cell:

# Disable HTTPS verification warnings (not for production).
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

###

import requests
import json

from requests.auth import HTTPBasicAuth

endpoint = "https://opensearch.hybrid-search.svc.cluster.local:9200"
auth = HTTPBasicAuth("admin", opensearch_password)

# A function that calls an HTTP request with an optional json body,
# prints the json output and returns it.
def request(method, uri, body=None):
    r = requests.request(method=method, url=f"{endpoint}{uri}", auth=auth, verify=False, json=body)
    r.raise_for_status()
    result = r.json()
    print(json.dumps(result, indent=2))
    return result

request("GET", "/");

After hitting ENTER, we should see something like:

{
  "name": "opensearch-655469f55d-4g6p9",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "foC8yyyMT0u697zIWyCyLw",
  "version": {
    "distribution": "opensearch",
    "number": "2.14.0",
    "build_type": "tar",
    "build_hash": "aaa555453f4713d652b52436874e11ba258d8f03",
    "build_date": "2024-05-09T18:51:00.973564994Z",
    "build_snapshot": false,
    "lucene_version": "9.10.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

If you get an error, ensure that the node successfully started by looking at the logs in the deployments page. Search for a message similar to "Node opensearch-655469f55d-4g6p9 initialized".

Registering a model and making some queries

So far so good. Now, let's trigger a few requests in rapid succession in order to be able to register the Hugging Face sentence-transformers model we are going to use for embeddings.

We will use a built-in model this first part of the article, and customize it later, in the second part. For reference, the model name is “huggingface/sentence-transformers/all-MiniLM-L6-v2”.

Let's enable "ml_commons":

request("PUT", "/_cluster/settings", {
  "persistent": {
    "plugins": {
      "ml_commons": {
        "only_run_on_ml_node": "false",
        "model_access_control_enabled": "true",
        "native_memory_threshold": "99"
      }
    }
  }
});

model_group_name = "my_model_group"

body = request("GET", "/_plugins/_ml/model_groups/_search", {
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "name": [model_group_name]
          }
        }
      ]
    }
  }
})

model_group_id = None
model_group_docs = body.get('hits', {}).get('hits', {})
if model_group_docs:
    model_group_id = model_group_docs[0]['_id']

if not model_group_id:
    body = request("POST", "/_plugins/_ml/model_groups/_register", {
      "name": model_group_name,
      "description": f"A model group named {model_group_name}",
      "access_mode": "public"
    })
    model_group_id = body['model_group_id']

model_group_id

Create the model in OpenSearch:

body = request("POST", "/_plugins/_ml/models/_register", {
  "name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
  "version": "1.0.1",
  "model_group_id": model_group_id,
  "model_format": "TORCH_SCRIPT"
})

task_id = body['task_id']

# Write this as a function, because we're going to use it a few times.
def await_for_task(task_id):
    import time

    print(f"Awaiting for the task {task_id} to finish", end='')

    model_id = None
    while True:
        r = requests.get(f"{endpoint}/_plugins/_ml/tasks/{task_id}", auth=auth, verify=False)
        r.raise_for_status()

        body = r.json()
        state = body['state']
        if state == "COMPLETED":
            print("Done!")
            return body['model_id']

        print('.', end='')
        time.sleep(1)
        
model_id = await_for_task(task_id)
model_id

Wait for it to be finished. Make note of the model_id as we are going to need it later.

Deploy the model:

body = request("POST", f"/_plugins/_ml/models/{model_id}/_deploy")
task_id = body['task_id']

model_id = await_for_task(task_id)
model_id

Again, wait for it. Now try the model with a random sentence:

request("POST", f"/_plugins/_ml/_predict/text_embedding/{model_id}", {
  "text_docs":[ "today is sunny"],
  "return_number": True,
  "target_response": ["sentence_embedding"]
});

The answer should look something like:

{
  "inference_results": [
    {
      "output": [
        {
          "name": "sentence_embedding",
          "data_type": "FLOAT32",
          "shape": [
            384
          ],
          "data": [
            -0.023314998,
            0.08975688,
            0.07847973,
            0.0610709,
            // ... skipped values ...
            0.045115355
          ]
        }
      ]
    }
  ]
}

Creating an ingest pipeline and indexing

Create an ingest pipeline:

ingest_pipeline = "my-ingest-pipeline"

request("PUT", f"/_ingest/pipeline/{ingest_pipeline}", {
  "processors": [
    {
      "text_embedding": {
        "model_id": model_id,
        "field_map": {
          "text": "passage_embedding"
        }
      }
    }
  ]
});

Create the index:

index_name = "my-index-name"

# Delete the index if already exists, so the notebook can be replayed.
r = requests.get(f"{endpoint}/{index_name}", auth=auth, verify=False)
if r.status_code != 404:
    r = requests.delete(f"{endpoint}/{index_name}", auth=auth, verify=False)
    r.raise_for_status()
    
request("PUT", f"/{index_name}", {
  "settings": {
    "index.knn": True,
    "default_pipeline": ingest_pipeline
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
        "type": "text"
      }
    }
  }
});

Index a few docs:

docs = [{
  "text": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
  "id": "4319130149.jpg"
}, {
  "text": "A wild animal races across an uncut field with a minimal amount of trees .",
  "id": "1775029934.jpg"
}, {
  "text": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
  "id": "2664027527.jpg"
}, {
  "text": "A man who is riding a wild horse in the rodeo is very near to falling off .",
  "id": "4427058951.jpg"
}, {
  "text": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
  "id": "2691147709.jpg"
}]

for doc in docs:
    request("PUT", f"/{index_name}/_doc/{doc['id']}", doc)

Ensure that it regular keyword search is working:

request("GET", f"/{index_name}/_search", {
  "_source": {
    "excludes": [
      "passage_embedding"
    ]
  },
  "query": {
    "match": {
      "text": {
        "query": "wild west"
      }
    }
  }
});

Ensure neural search is also working:

request("GET", f"/{index_name}/_search", {
  "_source": {
    "excludes": [
      "passage_embedding"
    ]
  },
  "query": {
    "neural": {
      "passage_embedding": {
        "query_text": "wild west",
        "model_id": model_id,
        "k": 5
      }
    }
  }
});

Check that hybrid search is working:

request("GET", f"/{index_name}/_search", {
  "_source": {
    "exclude": [
      "passage_embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "passage_text": {
              "query": "wild west"
            }
          }
        },
        {
          "neural": {
            "passage_embedding": {
              "query_text": "wild west",
              "model_id": model_id,
              "k": 5
            }
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "min_max"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.3,
                0.7
              ]
            }
          }
        }
      }
    ]
  }
});

Creating a search template and getting OpenSearch ready

Let's now create a search template with the above query, so we can quickly reuse it:

search_template = "my-search-template"

request("POST", f"/_scripts/{search_template}", {
  "script": {
    "lang": "mustache",
    "source": {
      "from": "{{from}}{{^from}}0{{/from}}",
      "size": "{{size}}{{^size}}10{{/size}}",
      "_source": {
        "exclude": [
          "passage_embedding"
        ]
      },
      "query": {
        "hybrid": {
          "queries": [
            {
              "match": {
                "passage_text": {
                  "query": "{{query}}"
                }
              }
            },
            {
              "neural": {
                "passage_embedding": {
                  "query_text": "{{query}}",
                  "model_id": "{{model_id}}",
                  "k": 5
                }
              }
            }
          ]
        }
      },
      "search_pipeline": {
        "phase_results_processors": [
          {
            "normalization-processor": {
              "normalization": {
                "technique": "min_max"
              },
              "combination": {
                "technique": "arithmetic_mean",
                "parameters": {
                  "weights": [
                    0.3,
                    0.7
                  ]
                }
              }
            }
          }
        ]
      }
    },
    "params": {
      "query": ""
    }
  }
});

If everything went okay, we should be able to search using the template:

request("GET", f"/{index_name}/_search/template", {
  "id": search_template,
  "params": {
    "query": "basket",
    "model_id": model_id,
    "size": 2
  }
});

The response from OpenSearch should look something like this:

{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.7,
    "hits": [
      {
        "_index": "my-index-name",
        "_id": "4319130149.jpg",
        "_score": 0.7,
        "_source": {
          "text": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
          "id": "4319130149.jpg"
        }
      },
      {
        "_index": "my-index-name",
        "_id": "2691147709.jpg",
        "_score": 0.00070000003,
        "_source": {
          "text": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
          "id": "2691147709.jpg"
        }
      }
    ]
  }
}

Nice! It’s been a long trip till getting everything in a few lines of JSON.

Putting a user interface together

Create a private GitHub repository and let's create a very simple Streamlit application. Add these three files to the repository: Dockerfile (for Docker container creation), requirements.txt (for Python packages) and streamlit_app.py (the streamlit application itself).

So, Dockerfile:

FROM python:3.9-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    software-properties-common \
    && rm -rf /var/lib/apt/lists/*

ADD requirements.txt streamlit_app.py .

RUN pip3 install -r requirements.txt

EXPOSE 80

HEALTHCHECK CMD curl --fail http://localhost/_stcore/health

ENTRYPOINT ["streamlit", "run", "streamlit_app.py", "--server.port=80", "--server.address=0.0.0.0"]

Then requirements.txt (yes, only these two packages):

streamlit==1.31.1
requests==2.32.1

Finally, streamlit_app.py:

import os
import requests

from requests.auth import HTTPBasicAuth

opensearch_password = os.getenv("OPENSEARCH_PASSWORD")
assert opensearch_password, "OPENSEARCH_PASSWORD environment variable is required"
opensearch_model_id = os.getenv("OPENSEARCH_MODEL_ID")
assert opensearch_model_id, "OPENSEARCH_MODEL_ID environment variable is required"

opensearch_endpoint = os.getenv(
    "OPENSEARCH_ENDPOINT", "https://opensearch.hybrid-search.svc.cluster.local:9200"
)
opensearch_index = os.getenv("OPENSEARCH_INDEX", "my-index-name")
opensearch_template = os.getenv("OPENSEARCH_TEMPLATE", "my-search-template")


# A function that calls an HTTP request with an optional json body,
# prints the json output and returns it.
def request(method, uri, body=None):
    r = requests.request(
        method=method,
        url=f"{opensearch_endpoint}{uri}",
        auth=HTTPBasicAuth("admin", opensearch_password),
        verify=False,
        json=body,
    )
    r.raise_for_status()
    result = r.json()
    return result

### Streamlit code starts here ###

import streamlit as st

st.title("OpenSearch Hybrid Search")

if "my_query" not in st.session_state:
    st.session_state.my_query = ""

def submit():
    st.session_state.my_query = st.session_state.widget
    st.session_state.widget = ""

st.text_input("Enter your query", key="widget", on_change=submit)

my_query = st.session_state.my_query

if my_query:
    result = request(
        "GET",
        f"/{opensearch_index}/_search/template",
        {
            "id": opensearch_template,
            "params": {"query": my_query, "model_id": opensearch_model_id, "size": 2},
        },
    )
    st.text(f"Results for \"{my_query}\":")
    st.table([hit['_source'] for hit in result['hits']['hits']])

This is the basic example from Streamlit documentation but using port 80 instead of the default 8501.

Commit and push to your GitHub repo and now open TrueFoundry. Go to Deployments dashboard, smash "+New Deployment" and select your workspace, then "Code from Git repo" and "Deploy from your Git repo" (so we can use a private repo).

Click "Next", choose a name. "streamlit" should be fine. Select "Source Code (Build and deploy source code)".

Click on "Integrate Git". A new tab will be spawned in your browser. Choose "Github" and click on "Link Github". A new screen will be spawned. When asked, choose your account and grant access only to your repository. It's cool to be able to select access to a single repo.

Close the tab and return to the wizard. "Repo URL" should now allow you to choose your repository. Everything should be correct and automatically fixed now: Dockerfile path, path to build context, default command.

In the environment variables section, let’s add both OPENSEARCH_PASSWORD, OPENSEARCH_MODEL_ID. The OpenSearch password is just the one we generated add the beginning, when deploying OpenSearch and the model id, is the model id key that OpenSearch assigned to the model. Grab it from the Jupyter notebook.

Let's change the port to 80 and expose it as we did before. Choose a DNS name for the service that you please. Hit "Submit". Once the deployment status is green, the example Streamlit application should be visible if you click the "Endpoint" button.

Did it work? That's great!

Conclusion

So far, we created a very basic application on a single OpenSearch node with a very simple UI on Streamlit and we controlled the whole process using a Jupyter notebook. This is very basic deployment but could be the key to a bigger development.

There are plenty of directions to take from this example towards transforming it in a major production application. We could go from actually deploying an OpenSearch Kubernetes operator to create a managed OpenSearch cluster, to adjust the autoscaling of the Streamlit app. We could index more and different data or even develop endpoints for indexing, etc.

However, we will focus on a particular step towards creating a production application: deploying a custom model for the embeddings.

For some organizations, machine learning is more difficult (or even forbidden) because no data should leave the organization's network. And therefore, training a machine learning model is not possible. Having OpenSearch with custom models, fine tuned for the organization's domain, hosted on premise, could be a game changer.

See you in the next part.