Quick Survey for Elasticsearch in generative AI

Frank Chung
DeepQ Research Engineering Blog
5 min readJul 29, 2024

Background

Elasticsearch is a distributed, non-relational database that provide multiple search methods, like fuzzy-search, search-as-you-type and embedding search.

This post is simply recorded for myself as a learning note, and hopefully it makes noobs to quickly understand basic Elasticsearch from scratch.

Terminology

Supposedly you know the concept of SQL, here is the mapping terms between SQL-like DB and Elasticsearch.

Field Data Types

Elasticsearch has built-in several basic data types, and construct some overlay and combination on them, here provides several useful and common usages of some data types.

Data Sharding and Replicating

To store large amount of data in distributed system, Elasticsearch applies data sharding and replicating to guarantee the performance and robustness.

Each index will be splitted by number of shards, a new document will be inserted to a shard based on its document id, the basic idea is:

shard_id = hash(document_id) % num_shards

The usage of shards is to search each shards in parallel, search results in each shard will be collected and rank the final result. As a result, the number of shards is propositional to its performance in concept. However there is side effect of increasing the number of shards:

  • First, the search parallelism depends on the machine capacity (cores, memory and disk).
  • Second, network bandwidth is the bottleneck if sharding results are transmitted across nodes.
  • Third, there is approximation during the sharding search, if the documents are not normally distributed, the aggregated results has large approximation error.

In general, it is recommended to apply number of shards by following rules:

  • 1 shard = 20 ~ 50 GB Disk
  • 1 shard = 1 CPU core

Replicating the data can improve the failover and data recover, it is recommended from replica = 2 to balance the performance and stability.

Embedding Search Implementation

Now let’s do some practice to see how to use Elasticsearch to implement an embedding search for generative AI.

First run a simplest Elasticsearch server

docker run -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -e "discovery.type=single-node" elasticsearch:8.12.1

After initialization, we can put a mapping to create an index my-index that has 2 shards, and a nested fields vectors to store multiple vector array([]float64), and exclude the vectors to ignore the vectors in the output:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"_source": {
"excludes": [
"vectors.vector"
]
},
"properties": {
"text": { "type": "keyword" },
"vectors": {
"type": "nested",
"properties": {
"vector": {
"type": "dense_vector",
"dims": 384,
"similarity": "dot_product"
}
}
}
}
}
}
'

Then we can create embedding vectors of a string ["apple", "book", "car"] by Sentence-Bert:

pip install sentence-transformers elasticsearch
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

# Initialize the Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2') # You can choose a model suitable for your needs

# List of input strings
texts = ["apple", "book", "car"]

# Convert texts to vectors
vectors = [model.encode(text).tolist() for text in texts]

# Initialize the Elasticsearch client
es = Elasticsearch("http://localhost:9200") # Adjust if your setup is different

# Index each document with its vector
for i, vector in enumerate(vectors):
doc_id = str(i + 1) # Use ID 1, 2, 3 for documents
document = {
'text': texts[i],
'vectors': [
{
'vector': vector
}
]
}
response = es.index(index='my_index', id=doc_id, body=document)
print(f"Indexed document with ID {doc_id}: {response}")

Now we can do search on another word applepie by inputing the vector to search related documents:

import json
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

# Initialize the Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2') # You can choose a model suitable for your needs

# Initialize the Elasticsearch client
es = Elasticsearch("http://localhost:9200") # Adjust if your setup is different

vector_to_search = model.encode("applepie").tolist()
resp = es.search(
index='my_index',
knn={
"field": "vectors.vector",
"k": 3,
"num_candidates": 3,
"query_vector": vector_to_search,
}
)
print(json.dumps(resp.body, indent=2))

Then we can get following order result:

{
"took": 21,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.826772,
"hits": [
{
"_index": "my_index",
"_id": "1",
"_score": 0.826772,
"_source": {
"text": "apple"
}
},
{
"_index": "my_index",
"_id": "3",
"_score": 0.6224603,
"_source": {
"text": "car"
}
},
{
"_index": "my_index",
"_id": "2",
"_score": 0.5856434,
"_source": {
"text": "book"
}
}
]
}
}

In this example, we need to take care that k in the request is the top-k candidates to be returned. And num_candidates is the top-k for each shard, and do final ranking after comparing all the candidates. If num_candidates is smaller than k, the final result is not accurate enough.

Security

DeepQ is a healthcare company with ISO 27799 certificate, we also care about information security, let’s check what we can do with Elasticsearch.

  • API Authorization: use rest API to manage API key lifecycle.
  • TLS Encryption: enable client-server encryption of internode encryption by providing SSL key.
  • Disk Encryption: Elasticsearch does not natively support encryption at rest, the only method is to apply disk management utils like dm-crypt to encrypt the whole disk partition.
  • Fields Encryption: Elasticsearch does not support encrypt function natively, we need to encrypt the fields in the application and store it using Binary data format. But keep in mind that Binary type is not searchable, if we want to do equality search on a staticly encrypted field like password, should still use Keyword type for search.

References

--

--