The Retrieval Part of RAG with Elasticsearch for none-English languages.

Pongsasit Thongpramoon
6 min readJan 26, 2024

--

Agenda of the topic

  1. Start elasticsearch with docker locally/ elastic cloud free trial
  2. Play with multiple vector embedder
    Option 1 Host embedder yourself
    Option 2 with elasticcloud
  3. Create index , setting, mapping and indexing
  4. Search options
  5. RAG improvement for language which has no re-ranker (Thai language)

Start elasticsearch with docker locally/ elastic cloud free trial

  1. Local environment setup

reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

step1: You must already started docker engine.
step2: create docker network

docker network create elastic

step3: Pull the elasticsearch image

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.0

step4: Run the elasticsearch container

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.12.0
  • ** when this run finished, The command prints the elastic user password and an enrollment token for Kibana.

step5: Setup credential related stuff
use step4’s output ("your_password")

export ELASTIC_PASSWORD="your_password"

Get cert file

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Test the setup
curl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD https://localhost:9200

b. IF you want to use eland and elastic ML Nodes, please use elastic cloud instead. (free trial 14 days)

https://www.elastic.co/guide/en/machine-learning/current/setup.html

Play with multiple vector embedder

WHY elasticsearch instead of Milvus???

  1. If you use English as input/output, the model re-ranker can be used to make your retrieval in RAG not that difficult.
  2. But what if your user are non-English speaker/reader??? If your country have re-ranker model, that is good. If not, the scoring of multiple methodology is important.
  3. Elastic has fuzzy search logic (applying analyzer and tokenizer for your language and use n-gram methodology to generate the score, then you can later boost the scoring) mixing with semantic search.
  4. As of now Milvus can only have 1 vector per 1 collection, so if you are none-english or multi language document RAG developer, maybe you need more than 1 collections for the languages. However elasticsearch 8.12 already support the multiple vector field in 1 index!
  5. ref: https://github.com/milvus-io/milvus/issues/25639

Option 1 Host embedder yourself

If you used docker as your hosting for this environment you can create the system like this architecture

FE (frontend)[Pass text input to backend] →
BE(Backend) [Process text2vec →search payload to Elasticsearch] →
ES (elasticsearch) [Do search]

Pros: Cost effective, since we do not have to host embedder model server.

Option 2With ElasticCloud ML Node (Available on IBM cloud https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch)

FE (frontend)[Pass text input to backend] →
BE(Backend) [Put the text into search payload then pass to Elasticsearch ML node] →
ES (elasticsearch) [Get text→embed→Dosearch]

Pros: Microservice BE can be smaller, and take advantage of ML node in case lot of searhes request come.

Create index , setting, mapping and indexing

Before we talk about index, setting, mapping and indexing. Let me quickly introduce the name.

Elasticsearch has, Node → Shard → Index → Field Node and Shard is the infrastructure: we will skip in this blog

  • Index is the way you keep the semi-structured data in it. (Structured data you use SQL table, which each tables has a schema (columns))
  • Field is the key of the semi-structure data you keep/ (Think it is a columns of SQL in the scenario of structured data)
  • Setting → elasticsearch is not only Database but also SEARCH Engine, this setting will include the way you set, (n gram, analyzer, tokenizer)
  • Mapping → Data type of each field and applying your setting into it.

This blog I will show the way you want to search something, related to the characters from Harry Potter, Naruto and One-piece.

The data is something like this.

  • For Local elasticsearch using docker
from elasticsearch import Elasticsearch

client = Elasticsearch(
hosts=[
"https://localhost:9200"
],
http_auth=('USERNAME', 'PASSWORD'),
ca_certs="./http_ca.crt",
)p
from elasticsearch import Elasticsearch 
es = Elasticsearch(cloud_id='ELASTIC_CLOUD_ID',
api_key='ELASTIC_API_KEY',
request_timeout=600 )
  • Then prepare for creating the Index.
setting_mapping = {
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "icu_tokenizer",
"filter": [
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 3,
"min_shingle_size": 2,
"output_unigrams": "true"
}
}
}
}
},
"mappings": {
"properties": {
"th_character": {"type": "text"},
"th_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
"en_character": {"type": "text"},
"en_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
"vector_en": {
"type": "dense_vector",
"dims": 1024,
"index": True,
"similarity": "cosine"
},
"vector_th": {
"type": "dense_vector",
"dims": 768,
"similarity": "cosine"
}
}
}
}

client.indices.create(index='search_character_test', body= setting_mapping)
  • Then create the index
client.indices.create(index='search_character_test', body= setting_mapping)
  • Then upload the data!

This will be different from elasticcloud and your local docker container. For the elastic cloud please follow the REF here! → https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch
and
https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb

For the docker local version please continue reading this blog post.

  • Then prepare the data.
import pandas as pd
from sentence_transformers import SentenceTransformer
from sentence_transformers import SentenceTransformer, models

def get_model(model_name='airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768, condition=True):
if condition:
# model_name = 'airesearch/wangchanberta-base-att-spm-uncased'
# model_name = "hkunlp/instructor-large"
word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),pooling_mode='cls') # We use a [CLS] token as representation
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
return model

english_embedder = get_model('BAAI/bge-large-en-v1.5', max_seq_length=768)
thai_embedder = get_model('airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768)

data_read = pd.read_csv('./mock_data/character_data.csv') # this is what I showed previously
data_set = data_read.to_dict('records')

for data in data_set:
data['vector_en'] = english_embedder.encode(data['en_character_description'])
data['vector_th'] = thai_embedder.encode(data['th_character_description'])
client.index(index='search_character_test', document=data)
client.indices.refresh()

Now your index is already ready for search (R in RAG)!!!!

Search options

Elasticsearch has both semantic (vector search) and fuzzy logic (n-grams word similarity).

As you can see the setting we already applied analyzer and tokenizer in our index already, and have vector field in our mapping.

Warm up.

# Get all data
resp = client.search(index='search_character_test', query={"match_all":{}})
for hit in resp['hits']['hits']:
print(hit['_source']['th_character'])

Start the fuzzy logic.

th_question = 'ใครที่รอดจากคำสาปของลอร์ดโวลเดอมอร์'
en_question = 'Who is survived from the dead curse of lord Voldemort'
fuzzy_en_payload = {
"fuzzy": {
"en_character_description": {
"value": f"{en_question}",
"fuzziness": "AUTO"
}
}
}
fuzzy_en_response = client.search(index="search_character", query=fuzzy_en_payload)
for hit in fuzzy_en_response['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])

fuzzy_th_payload = {
"fuzzy": {
"th_character_description": {
"value": f"{th_question}",
"fuzziness": "AUTO"
}
}
}
fuzzy_th_response = client.search(index="search_character", query=fuzzy_th_payload)
for hit in fuzzy_th_response['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])

From these fuzzy searches you will get SCORE from ‘_score’.

Next let’s see the semantic search.

query_vector_en = english_embedder.encode(en_question)
semantic_query_en = {
"field": "vector_en",
"query_vector": query_vector_en,
"k":4,
"num_candidates": 20
}
semantic_resp_en = client.search(index="search_character", knn=semantic_query_en)
for hit in semantic_resp_en['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])


query_vector_th = thai_embedder.encode(th_question)
semantic_query_th = {
"field": "vector_th",
"query_vector": query_vector_th,
"k":4,
"num_candidates": 20
}
semantic_resp_th = client.search(index="search_character", knn=semantic_query_th)
for hit in semantic_resp_th['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])

From these vector searches you will also get SCORE from ‘_score’.

RAG improvement for language which has no re-ranker (Thai language)

In case your user use English, it is OK to skip these scores because you have the re-ranker model. (REF: https://medium.com/towards-generative-ai/improving-rag-retrieval-augmented-generation-answer-quality-with-re-ranker-55a19931325)

But for the Thai language or other none-English, we may need to use the traditional NLP as a score then, do the ranking by the scores we’ve got (semantic plus fuzzy).

We have scores, we can weighted and boosted these scores as a re-ranker (statistical/mannualy) based on your use case preference.

How to use elasticsearch with IBM cloud please visit: https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch

--

--