Semantic Search by Amazon OpenSearch Serverless & Amazon Bedrock

4 min readApr 9, 2024

Introduction

In this article, we will cover the topic, “how we can leverage OpenSearch Serverless to perform Semantic search”, this is something different than typical full text search.

Amazon OpenSearch

It is derived from a mature version of Elasticsearch
It is a highly scalable, fully managed search engine and analytics service

Amazon OpenSearch Serverless

It is a serverless deployment option for OpenSearch that eliminates the need to provision and manage infrastructure.

Full Text Search

It is a search technique that looks for matches of a search query within the entire text of a document rows or a set of documents.
It searches for all occurrences of the words or phrases that are specified in the search query, regardless of their location or context within the document.

Semantic Search

It is a more advanced search technique understands the meaning of the search query and the context in which it is used.
It uses natural language processing (NLP) and machine learning to understand the intent behind the search query and provide more relevant results.

Vector Embeddings

Way to represent complex data, such as words, sentences, or even images as points in a vector space, using vectors of real numbers
Vector embeddings are a way to convert words and sentences and other data (audio, image etc) into numbers that capture their meaning and relationships

Semantic Search & Vector Embedding

Semantic search is one of the most popular uses of vector embeddings. Search algorithms like KNN and ANN require us to calculate distance between vectors to determine similarity.

Types of Vector Embedding

Word / Text Embedding
Document Embedding
Image Embedding
Audio Embedding

Create a Collection

Collection is very similar to Database in RDBMS world. Here are the steps to create a collection.

Go to AWS Management Console
From the Services -> select OpenSearch
Choose Serverless

Create Vector Index with JSON configuration

Index is similar to Table in the Database. Here are steps covers to create vector index in a collection.

Refer to the following JSON configuration, which has “mappings” & “settings” details to create an index

Note: We have only two properties (fields (or) columns) in this index(table)

doc_text : Data type is text
doc_vector: Data type is vector with dimension 4096

{
    "mappings": {
        "properties": {
            "doc_text": {"type": "text"},
            "doc_vector": {
                "type": "knn_vector",
                "dimension": 4096,
                "method": {
                    "engine": "nmslib",
                    "space_type": "cosinesimil",
                    "name": "hnsw",
                    "parameters": {"ef_construction": 512, "m": 16}
                }
            }
        }
    },
    "settings": {
        "index": {
            "number_of_shards": 1,
            "knn.algo_param": {"ef_search": 512},
            "knn": true
        }
    }
}

Ensure the index is created within the collection.

Create Lambda function to ingest words & perform semantic search

Go to AWS Management Console
Choose Lambda service
Create Function & enter the function name
Select Runtime as latest python 3.12 [Pls note: This runtime have latest boto3 having bedrock service.]
Ensure lambda’s IAM role has required policy permission for Bedrock and OpenSearch Service.

Code snippet : Create Vector Embedding for text/words leveraging Amazon Titan model.

import json
import boto3
import logging
from opensearchpy import OpenSearch, RequestsHttpConnection

logger = logging.getLogger()
logger.setLevel(logging.INFO)

bedrock = boto3.client(service_name='bedrock-runtime')

def lambda_handler(event, context):

    # Covert Word to Vector
    
    text_obj1 = "Tital Model : Text & Image generation, Summarization,"
    text_obj2 = "Stable Diffusion: Generate high quality image"
    text_obj3 = "Claude: Content Creation and Complex Reasoning"

    vector_obj1 = word_embedding(text_object1)
    vector_obj2 = word_embedding(text_object2)    
    vector_obj3 = word_embedding(text_object2)

    # Ingest vector embedding to OpenSearch
    ingest_document(text_obj1, vector_obj1)
    ingest_document(text_obj2, vector_obj2)
    ingest_document(text_obj3, vector_obj3)

def word_embedding(text):
    body=json.dumps({"inputText": text})
    response = bedrock.invoke_model(body=body, modelId='amazon.titan-embed-text-v1', accept='application/json', contentType='application/json')
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

Code Snippet: Ingest Vector Embedding to OpenSearch

opensearch_client = OpenSearch(
    hosts = [{"host": "opensearch_endpoint_placeholder", "port": 443}],
    http_auth = auth, use_ssl = True, verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 10
)

def ingest_document(text_obj, vector_obj):
    document = {
      "doc_text": text_obj,
      "doc_vector": vector_obj
    }
    
    response = client.index(
        index = 'doc_index1',
        body = document
    )

Code snippet : Perform Semantic Search using OpenSearch Query DSL

Lets say we want to perform semantic search for “Image Model” from the index, first we need to convent search query text to vector embedding and then call the following method.

def perform_vector_search(vector):
    document = {
        "size": 15,
        "_source": {"excludes": ["doc_vector"]},
        "query": {
            "knn": {
                 "doc_vector": {
                     "vector": vector,
                     "k":10
                 }
            }
        }
    }
    response = client.search(
    body = document,
    index = "doc_index1"
    )
    return response

Conclusion

Finally to conclude, we have seen, how the OpenSearch provides efficient vector similiarity search by providing specialized k-NN index.

In next article, we will cover the use case ofOpenSearch Service’s vector database with Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic Search by Amazon OpenSearch Serverless & Amazon Bedrock

Written by Bharathvajan G