How to use Elasticsearch as Vector Database

Sagar Gangurde
Data Engineering
Published in
2 min readMar 19, 2024

Lets setup single node Elasticsearch cluster on local machine.

Pull the docker images.

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker pull docker.elastic.co/kibana/kibana:8.12.2

Start Elasticsearch and Kibana containers.

docker network create elastic

docker run -d --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -m 1GB -e "discovery.type=single-node" -e "ELASTIC_PASSWORD=passw0rd" docker.elastic.co/elasticsearch/elasticsearch:8.12.2

docker run -d --name kibana --net elastic -p 5601:5601 docker.elastic.co/kibana/kibana:8.12.2

Verify that containers are up and running.

docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e288f61740da docker.elastic.co/kibana/kibana:8.12.2 "/bin/tini -- /usr/l…" About an hour ago Up About an hour 0.0.0.0:5601->5601/tcp kibana
16e62f66f4e0 docker.elastic.co/elasticsearch/elasticsearch:8.12.2 "/bin/tini -- /usr/l…" About an hour ago Up About an hour 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp elasticsearch

Now, lets create `movies` index. We will be using text-embedding-3-small model to generate the vector embeddings for title field and store it as title_embedding. This model generates embeddings of length 1536. So we need to specify title_embedding field mapping as dense_vector with 1536 dimensions.

PUT /movies
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"genre": {
"type": "keyword"
},
"release_year": {
"type": "integer"
},
"title_embedding": {
"type": "dense_vector",
"dims": 1536
}
}
}
}

Lets insert few documents using Elasticsearch Python client.

Python client needs `ssl_assert_fingerprint` to connect to Elasticsearch. Lets get that using following command:

openssl s_client -connect localhost:9200 -servername localhost -showcerts </dev/null 2>/dev/null \        
| openssl x509 -fingerprint -sha256 -noout -in /dev/stdin

sha256 Fingerprint=AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33

Now we can insert few documents in movies index.

from elasticsearch import Elasticsearch
from openai import OpenAI

es_client = Elasticsearch(
"https://localhost:9200",
ssl_assert_fingerprint='AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33',
basic_auth=("elastic", "passw0rd")
)
openai_client = OpenAI(api_key='<openAI-API-key>')

movies = [
{"title": "Inception", "genre": "Sci-Fi", "release_year": 2010},
{"title": "The Shawshank Redemption", "genre": "Drama", "release_year": 1994},
{"title": "The Godfather", "genre": "Crime", "release_year": 1972},
{"title": "Pulp Fiction", "genre": "Crime", "release_year": 1994},
{"title": "Forrest Gump", "genre": "Drama", "release_year": 1994}
]

# Indexing movies
for movie in movies:
movie['title_embedding'] = openai_client.embeddings.create(
input=[movie['title']], model='text-embedding-3-small'
).data[0].embedding
es_client.index(index="movies", document=movie)

Lets say, we want to search movies that closely match with titleGodfather. We can use K-Nearest Neighbors (KNN) algorithm to search for relevant documents. We will limit our search to show only 1 closest matching result.

First we need to get the vector representation for word Godfather

vector_value = openai_client.embeddings.create(
input=["Godfather"], model='text-embedding-3-small'
).data[0].embedding

Now we can search the movies index to get the movies that closely match with titleGodfather. In our case, it should match movie doc with title The Godfather.

query_string = {
"field": "title_embedding",
"query_vector": vector_value,
"k": 1,
"num_candidates": 100
}

results = es_client.search(index="movies", knn=query_string, source_includes=["title", "genre", "release_year"])

print(results['hits']['hits'])

As expected, we got the correct result!

[{
"_index": "movies",
"_id": "XvDTV44BCOE-aWDhxeQK",
"_score": 0.8956262,
"_source":
{
"title": "The Godfather",
"genre": "Crime",
"release_year": 1972
}
}]

Hope this helps.

--

--