How to use Elasticsearch as Vector Database

Published in

Data Engineering

2 min readMar 19, 2024

Lets setup single node Elasticsearch cluster on local machine.

Pull the docker images.

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker pull docker.elastic.co/kibana/kibana:8.12.2

Start Elasticsearch and Kibana containers.

docker network create elastic

docker run -d --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -m 1GB -e "discovery.type=single-node" -e "ELASTIC_PASSWORD=passw0rd" docker.elastic.co/elasticsearch/elasticsearch:8.12.2

docker run -d --name kibana --net elastic -p 5601:5601 docker.elastic.co/kibana/kibana:8.12.2

Verify that containers are up and running.

docker ps

CONTAINER ID   IMAGE                                                  COMMAND                  CREATED             STATUS             PORTS                                            NAMES
e288f61740da   docker.elastic.co/kibana/kibana:8.12.2                 "/bin/tini -- /usr/l…"   About an hour ago   Up About an hour   0.0.0.0:5601->5601/tcp                           kibana
16e62f66f4e0   docker.elastic.co/elasticsearch/elasticsearch:8.12.2   "/bin/tini -- /usr/l…"   About an hour ago   Up About an hour   0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp   elasticsearch

Now, lets create `movies` index. We will be using text-embedding-3-small model to generate the vector embeddings for title field and store it as title_embedding. This model generates embeddings of length 1536. So we need to specify title_embedding field mapping as dense_vector with 1536 dimensions.

PUT /movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "genre": {
        "type": "keyword"
      },
      "release_year": {
        "type": "integer"
      },
      "title_embedding": {
        "type": "dense_vector",
        "dims": 1536
      }
    }
  }
}

Lets insert few documents using Elasticsearch Python client.

Python client needs `ssl_assert_fingerprint` to connect to Elasticsearch. Lets get that using following command:

openssl s_client -connect localhost:9200 -servername localhost -showcerts </dev/null 2>/dev/null \        
  | openssl x509 -fingerprint -sha256 -noout -in /dev/stdin

sha256 Fingerprint=AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33

Now we can insert few documents in movies index.

from elasticsearch import Elasticsearch
from openai import OpenAI

es_client = Elasticsearch(
    "https://localhost:9200",
    ssl_assert_fingerprint='AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33',
    basic_auth=("elastic", "passw0rd")
)
openai_client = OpenAI(api_key='<openAI-API-key>')

movies = [
    {"title": "Inception", "genre": "Sci-Fi", "release_year": 2010},
    {"title": "The Shawshank Redemption", "genre": "Drama", "release_year": 1994},
    {"title": "The Godfather", "genre": "Crime", "release_year": 1972},
    {"title": "Pulp Fiction", "genre": "Crime", "release_year": 1994},
    {"title": "Forrest Gump", "genre": "Drama", "release_year": 1994}
]

# Indexing movies
for movie in movies:
    movie['title_embedding'] = openai_client.embeddings.create(
        input=[movie['title']], model='text-embedding-3-small'
    ).data[0].embedding
    es_client.index(index="movies", document=movie)

Lets say, we want to search movies that closely match with titleGodfather. We can use K-Nearest Neighbors (KNN) algorithm to search for relevant documents. We will limit our search to show only 1 closest matching result.

First we need to get the vector representation for word Godfather

vector_value = openai_client.embeddings.create(
        input=["Godfather"], model='text-embedding-3-small'
    ).data[0].embedding

Now we can search the movies index to get the movies that closely match with titleGodfather. In our case, it should match movie doc with title The Godfather.

query_string = {
    "field": "title_embedding",
    "query_vector": vector_value,
    "k": 1,
    "num_candidates": 100
}

results = es_client.search(index="movies", knn=query_string, source_includes=["title", "genre", "release_year"])

print(results['hits']['hits'])

As expected, we got the correct result!

[{
    "_index": "movies",
    "_id": "XvDTV44BCOE-aWDhxeQK",
    "_score": 0.8956262,
    "_source":
    {
        "title": "The Godfather",
        "genre": "Crime",
        "release_year": 1972
    }
}]

Hope this helps.

How to use Elasticsearch as Vector Database

Written by Sagar Gangurde