How to use Elasticsearch as Vector Database
Lets setup single node Elasticsearch cluster on local machine.
Pull the docker images.
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker pull docker.elastic.co/kibana/kibana:8.12.2
Start Elasticsearch and Kibana containers.
docker network create elastic
docker run -d --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -m 1GB -e "discovery.type=single-node" -e "ELASTIC_PASSWORD=passw0rd" docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker run -d --name kibana --net elastic -p 5601:5601 docker.elastic.co/kibana/kibana:8.12.2
Verify that containers are up and running.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e288f61740da docker.elastic.co/kibana/kibana:8.12.2 "/bin/tini -- /usr/l…" About an hour ago Up About an hour 0.0.0.0:5601->5601/tcp kibana
16e62f66f4e0 docker.elastic.co/elasticsearch/elasticsearch:8.12.2 "/bin/tini -- /usr/l…" About an hour ago Up About an hour 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp elasticsearch
Now, lets create `movies` index. We will be using text-embedding-3-small
model to generate the vector embeddings for title
field and store it as title_embedding
. This model generates embeddings of length 1536
. So we need to specify title_embedding
field mapping as dense_vector
with 1536
dimensions.
PUT /movies
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"genre": {
"type": "keyword"
},
"release_year": {
"type": "integer"
},
"title_embedding": {
"type": "dense_vector",
"dims": 1536
}
}
}
}
Lets insert few documents using Elasticsearch Python client.
Python client needs `ssl_assert_fingerprint` to connect to Elasticsearch. Lets get that using following command:
openssl s_client -connect localhost:9200 -servername localhost -showcerts </dev/null 2>/dev/null \
| openssl x509 -fingerprint -sha256 -noout -in /dev/stdin
sha256 Fingerprint=AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33
Now we can insert few documents in movies
index.
from elasticsearch import Elasticsearch
from openai import OpenAI
es_client = Elasticsearch(
"https://localhost:9200",
ssl_assert_fingerprint='AA:BB:CC:3C:A4:99:12:A8:D6:41:B7:A6:52:ED:CA:2E:0E:64:E2:0E:A7:8F:AE:4C:57:0E:4B:A3:00:11:22:33',
basic_auth=("elastic", "passw0rd")
)
openai_client = OpenAI(api_key='<openAI-API-key>')
movies = [
{"title": "Inception", "genre": "Sci-Fi", "release_year": 2010},
{"title": "The Shawshank Redemption", "genre": "Drama", "release_year": 1994},
{"title": "The Godfather", "genre": "Crime", "release_year": 1972},
{"title": "Pulp Fiction", "genre": "Crime", "release_year": 1994},
{"title": "Forrest Gump", "genre": "Drama", "release_year": 1994}
]
# Indexing movies
for movie in movies:
movie['title_embedding'] = openai_client.embeddings.create(
input=[movie['title']], model='text-embedding-3-small'
).data[0].embedding
es_client.index(index="movies", document=movie)
Lets say, we want to search movies that closely match with titleGodfather
. We can use K-Nearest Neighbors (KNN) algorithm to search for relevant documents. We will limit our search to show only 1 closest matching result.
First we need to get the vector representation for word Godfather
vector_value = openai_client.embeddings.create(
input=["Godfather"], model='text-embedding-3-small'
).data[0].embedding
Now we can search the movies
index to get the movies that closely match with titleGodfather
. In our case, it should match movie doc with title The Godfather
.
query_string = {
"field": "title_embedding",
"query_vector": vector_value,
"k": 1,
"num_candidates": 100
}
results = es_client.search(index="movies", knn=query_string, source_includes=["title", "genre", "release_year"])
print(results['hits']['hits'])
As expected, we got the correct result!
[{
"_index": "movies",
"_id": "XvDTV44BCOE-aWDhxeQK",
"_score": 0.8956262,
"_source":
{
"title": "The Godfather",
"genre": "Crime",
"release_year": 1972
}
}]
Hope this helps.