The Retrieval Part of RAG with Elasticsearch for none-English languages.

Pongsasit Thongpramoon

6 min readJan 26, 2024

Agenda of the topic

Start elasticsearch with docker locally/ elastic cloud free trial
Play with multiple vector embedder
Option 1 Host embedder yourself
Option 2 with elasticcloud
Create index , setting, mapping and indexing
Search options
RAG improvement for language which has no re-ranker (Thai language)

Start elasticsearch with docker locally/ elastic cloud free trial

Local environment setup

reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

step1: You must already started docker engine.
step2: create docker network

docker network create elastic

step3: Pull the elasticsearch image

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.0

step4: Run the elasticsearch container

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.12.0

** when this run finished, The command prints the elastic user password and an enrollment token for Kibana.

step5: Setup credential related stuff
use step4’s output ("your_password")

export ELASTIC_PASSWORD="your_password"

Get cert file

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Test the setup
curl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD https://localhost:9200

b. IF you want to use eland and elastic ML Nodes, please use elastic cloud instead. (free trial 14 days)

→ https://www.elastic.co/guide/en/machine-learning/current/setup.html

Play with multiple vector embedder

WHY elasticsearch instead of Milvus???

If you use English as input/output, the model re-ranker can be used to make your retrieval in RAG not that difficult.
But what if your user are non-English speaker/reader??? If your country have re-ranker model, that is good. If not, the scoring of multiple methodology is important.
Elastic has fuzzy search logic (applying analyzer and tokenizer for your language and use n-gram methodology to generate the score, then you can later boost the scoring) mixing with semantic search.
As of now Milvus can only have 1 vector per 1 collection, so if you are none-english or multi language document RAG developer, maybe you need more than 1 collections for the languages. However elasticsearch 8.12 already support the multiple vector field in 1 index!
ref: https://github.com/milvus-io/milvus/issues/25639

Option 1 Host embedder yourself

If you used docker as your hosting for this environment you can create the system like this architecture

FE (frontend)[Pass text input to backend] →
BE(Backend) [Process text2vec →search payload to Elasticsearch] →
ES (elasticsearch) [Do search]

Pros: Cost effective, since we do not have to host embedder model server.

Option 2With ElasticCloud ML Node (Available on IBM cloud https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch)

FE (frontend)[Pass text input to backend] →
BE(Backend) [Put the text into search payload then pass to Elasticsearch ML node] →
ES (elasticsearch) [Get text→embed→Dosearch]

Pros: Microservice BE can be smaller, and take advantage of ML node in case lot of searhes request come.

Create index , setting, mapping and indexing

Before we talk about index, setting, mapping and indexing. Let me quickly introduce the name.

Elasticsearch has, Node → Shard → Index → Field Node and Shard is the infrastructure: we will skip in this blog

Index is the way you keep the semi-structured data in it. (Structured data you use SQL table, which each tables has a schema (columns))
Field is the key of the semi-structure data you keep/ (Think it is a columns of SQL in the scenario of structured data)
Setting → elasticsearch is not only Database but also SEARCH Engine, this setting will include the way you set, (n gram, analyzer, tokenizer)
Mapping → Data type of each field and applying your setting into it.

This blog I will show the way you want to search something, related to the characters from Harry Potter, Naruto and One-piece.

The data is something like this.

For Local elasticsearch using docker

from elasticsearch import Elasticsearch

client = Elasticsearch(
    hosts=[
            "https://localhost:9200"
    ],
    http_auth=('USERNAME', 'PASSWORD'),
    ca_certs="./http_ca.crt",
)p

For elasticcloud (available on IBM cloud!: https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch)

from elasticsearch import Elasticsearch 
es = Elasticsearch(cloud_id='ELASTIC_CLOUD_ID',   
  api_key='ELASTIC_API_KEY',
  request_timeout=600 )

Then prepare for creating the Index.

setting_mapping = {
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_shingle": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "filter_shingle"
            ]
          }
        },
        "filter": {
          "filter_shingle": {
            "type": "shingle",
            "max_shingle_size": 3,
            "min_shingle_size": 2,
            "output_unigrams": "true"
          }
        }
      }
    }
  },  
  "mappings": {
    "properties": {
        "th_character": {"type": "text"},
        "th_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
        "en_character": {"type": "text"},
        "en_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
        "vector_en": {
            "type": "dense_vector",
            "dims": 1024,
            "index": True,
            "similarity": "cosine"
        },
        "vector_th": {
            "type": "dense_vector",
            "dims": 768,
            "similarity": "cosine"
        }
    }
  }
}

client.indices.create(index='search_character_test', body= setting_mapping)

Then create the index

client.indices.create(index='search_character_test', body= setting_mapping)

Then upload the data!

This will be different from elasticcloud and your local docker container. For the elastic cloud please follow the REF here! → https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch
and
https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb

For the docker local version please continue reading this blog post.

Then prepare the data.

import pandas as pd
from sentence_transformers import SentenceTransformer
from sentence_transformers import SentenceTransformer, models

def get_model(model_name='airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768, condition=True):
    if condition:
        # model_name = 'airesearch/wangchanberta-base-att-spm-uncased'
        # model_name = "hkunlp/instructor-large"
        word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),pooling_mode='cls') # We use a [CLS] token as representation
        model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    return model

english_embedder = get_model('BAAI/bge-large-en-v1.5', max_seq_length=768)
thai_embedder = get_model('airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768)

data_read = pd.read_csv('./mock_data/character_data.csv') # this is what I showed previously
data_set = data_read.to_dict('records')

for data in data_set:
    data['vector_en'] = english_embedder.encode(data['en_character_description'])
    data['vector_th'] = thai_embedder.encode(data['th_character_description'])
    client.index(index='search_character_test', document=data)
client.indices.refresh()

Now your index is already ready for search (R in RAG)!!!!

Search options

Elasticsearch has both semantic (vector search) and fuzzy logic (n-grams word similarity).

As you can see the setting we already applied analyzer and tokenizer in our index already, and have vector field in our mapping.

Warm up.

# Get all data
resp = client.search(index='search_character_test', query={"match_all":{}})
for hit in resp['hits']['hits']:
    print(hit['_source']['th_character'])

Start the fuzzy logic.

th_question = 'ใครที่รอดจากคำสาปของลอร์ดโวลเดอมอร์'
en_question = 'Who is survived from the dead curse of lord Voldemort'

fuzzy_en_payload = {
        "fuzzy": {
            "en_character_description": {
              "value": f"{en_question}",
              "fuzziness": "AUTO"
            }
          }   
        }
fuzzy_en_response = client.search(index="search_character", query=fuzzy_en_payload)
for hit in fuzzy_en_response['hits']['hits']:
    print(hit['_score'])
    print(hit['_source']['th_character'])

fuzzy_th_payload = {
        "fuzzy": {
            "th_character_description": {
              "value": f"{th_question}",
              "fuzziness": "AUTO"
            }
          }   
        }
fuzzy_th_response = client.search(index="search_character", query=fuzzy_th_payload)
for hit in fuzzy_th_response['hits']['hits']:
    print(hit['_score'])
    print(hit['_source']['th_character'])

From these fuzzy searches you will get SCORE from ‘_score’.

Next let’s see the semantic search.

query_vector_en = english_embedder.encode(en_question)
semantic_query_en = {
    "field": "vector_en",
    "query_vector": query_vector_en,
    "k":4,
    "num_candidates": 20
}
semantic_resp_en = client.search(index="search_character", knn=semantic_query_en)
for hit in semantic_resp_en['hits']['hits']:
    print(hit['_score'])
    print(hit['_source']['th_character'])


query_vector_th = thai_embedder.encode(th_question)
semantic_query_th = {
    "field": "vector_th",
    "query_vector": query_vector_th,
    "k":4,
    "num_candidates": 20
}
semantic_resp_th = client.search(index="search_character", knn=semantic_query_th)
for hit in semantic_resp_th['hits']['hits']:
    print(hit['_score'])
    print(hit['_source']['th_character'])

From these vector searches you will also get SCORE from ‘_score’.

RAG improvement for language which has no re-ranker (Thai language)

In case your user use English, it is OK to skip these scores because you have the re-ranker model. (REF: https://medium.com/towards-generative-ai/improving-rag-retrieval-augmented-generation-answer-quality-with-re-ranker-55a19931325)

But for the Thai language or other none-English, we may need to use the traditional NLP as a score then, do the ranking by the scores we’ve got (semantic plus fuzzy).

We have scores, we can weighted and boosted these scores as a re-ranker (statistical/mannualy) based on your use case preference.

How to use elasticsearch with IBM cloud please visit: https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch