The Retrieval Part of RAG with Elasticsearch for none-English languages.
Agenda of the topic
- Start elasticsearch with docker locally/ elastic cloud free trial
- Play with multiple vector embedder
Option 1 Host embedder yourself
Option 2 with elasticcloud - Create index , setting, mapping and indexing
- Search options
- RAG improvement for language which has no re-ranker (Thai language)
Start elasticsearch with docker locally/ elastic cloud free trial
- Local environment setup
reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
step1: You must already started docker engine.
step2: create docker network
docker network create elastic
step3: Pull the elasticsearch image
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.0
step4: Run the elasticsearch container
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.12.0
- ** when this run finished, The command prints the
elastic
user password and an enrollment token for Kibana.
step5: Setup credential related stuff
use step4’s output ("your_password"
)
export ELASTIC_PASSWORD="your_password"
Get cert file
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
Test the setupcurl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD https://localhost:9200
b. IF you want to use eland and elastic ML Nodes, please use elastic cloud instead. (free trial 14 days)
→ https://www.elastic.co/guide/en/machine-learning/current/setup.html
Play with multiple vector embedder
WHY elasticsearch instead of Milvus???
- If you use English as input/output, the model re-ranker can be used to make your retrieval in RAG not that difficult.
- But what if your user are non-English speaker/reader??? If your country have re-ranker model, that is good. If not, the scoring of multiple methodology is important.
- Elastic has fuzzy search logic (applying analyzer and tokenizer for your language and use n-gram methodology to generate the score, then you can later boost the scoring) mixing with semantic search.
- As of now Milvus can only have 1 vector per 1 collection, so if you are none-english or multi language document RAG developer, maybe you need more than 1 collections for the languages. However elasticsearch 8.12 already support the multiple vector field in 1 index!
- ref: https://github.com/milvus-io/milvus/issues/25639
Option 1 Host embedder yourself
If you used docker as your hosting for this environment you can create the system like this architecture
FE (frontend)[Pass text input to backend] →
BE(Backend) [Process text2vec →search payload to Elasticsearch] →
ES (elasticsearch) [Do search]
Pros: Cost effective, since we do not have to host embedder model server.
Option 2With ElasticCloud ML Node (Available on IBM cloud https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch)
FE (frontend)[Pass text input to backend] →
BE(Backend) [Put the text into search payload then pass to Elasticsearch ML node] →
ES (elasticsearch) [Get text→embed→Dosearch]
Pros: Microservice BE can be smaller, and take advantage of ML node in case lot of searhes request come.
Create index , setting, mapping and indexing
Before we talk about index, setting, mapping and indexing. Let me quickly introduce the name.
Elasticsearch has, Node → Shard → Index → Field Node and Shard is the infrastructure: we will skip in this blog
- Index is the way you keep the semi-structured data in it. (Structured data you use SQL table, which each tables has a schema (columns))
- Field is the key of the semi-structure data you keep/ (Think it is a columns of SQL in the scenario of structured data)
- Setting → elasticsearch is not only Database but also SEARCH Engine, this setting will include the way you set, (n gram, analyzer, tokenizer)
- Mapping → Data type of each field and applying your setting into it.
This blog I will show the way you want to search something, related to the characters from Harry Potter, Naruto and One-piece.
The data is something like this.
- For Local elasticsearch using docker
from elasticsearch import Elasticsearch
client = Elasticsearch(
hosts=[
"https://localhost:9200"
],
http_auth=('USERNAME', 'PASSWORD'),
ca_certs="./http_ca.crt",
)p
- For elasticcloud (available on IBM cloud!: https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch)
from elasticsearch import Elasticsearch
es = Elasticsearch(cloud_id='ELASTIC_CLOUD_ID',
api_key='ELASTIC_API_KEY',
request_timeout=600 )
- Then prepare for creating the Index.
setting_mapping = {
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "icu_tokenizer",
"filter": [
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 3,
"min_shingle_size": 2,
"output_unigrams": "true"
}
}
}
}
},
"mappings": {
"properties": {
"th_character": {"type": "text"},
"th_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
"en_character": {"type": "text"},
"en_character_description": {"type": "text", "analyzer": "analyzer_shingle"},
"vector_en": {
"type": "dense_vector",
"dims": 1024,
"index": True,
"similarity": "cosine"
},
"vector_th": {
"type": "dense_vector",
"dims": 768,
"similarity": "cosine"
}
}
}
}
client.indices.create(index='search_character_test', body= setting_mapping)
- Then create the index
client.indices.create(index='search_character_test', body= setting_mapping)
- Then upload the data!
This will be different from elasticcloud and your local docker container. For the elastic cloud please follow the REF here! → https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch
and
https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb
For the docker local version please continue reading this blog post.
- Then prepare the data.
import pandas as pd
from sentence_transformers import SentenceTransformer
from sentence_transformers import SentenceTransformer, models
def get_model(model_name='airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768, condition=True):
if condition:
# model_name = 'airesearch/wangchanberta-base-att-spm-uncased'
# model_name = "hkunlp/instructor-large"
word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),pooling_mode='cls') # We use a [CLS] token as representation
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
return model
english_embedder = get_model('BAAI/bge-large-en-v1.5', max_seq_length=768)
thai_embedder = get_model('airesearch/wangchanberta-base-att-spm-uncased', max_seq_length=768)
data_read = pd.read_csv('./mock_data/character_data.csv') # this is what I showed previously
data_set = data_read.to_dict('records')
for data in data_set:
data['vector_en'] = english_embedder.encode(data['en_character_description'])
data['vector_th'] = thai_embedder.encode(data['th_character_description'])
client.index(index='search_character_test', document=data)
client.indices.refresh()
Now your index is already ready for search (R in RAG)!!!!
Search options
Elasticsearch has both semantic (vector search) and fuzzy logic (n-grams word similarity).
As you can see the setting we already applied analyzer and tokenizer in our index already, and have vector field in our mapping.
Warm up.
# Get all data
resp = client.search(index='search_character_test', query={"match_all":{}})
for hit in resp['hits']['hits']:
print(hit['_source']['th_character'])
Start the fuzzy logic.
th_question = 'ใครที่รอดจากคำสาปของลอร์ดโวลเดอมอร์'
en_question = 'Who is survived from the dead curse of lord Voldemort'
fuzzy_en_payload = {
"fuzzy": {
"en_character_description": {
"value": f"{en_question}",
"fuzziness": "AUTO"
}
}
}
fuzzy_en_response = client.search(index="search_character", query=fuzzy_en_payload)
for hit in fuzzy_en_response['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])
fuzzy_th_payload = {
"fuzzy": {
"th_character_description": {
"value": f"{th_question}",
"fuzziness": "AUTO"
}
}
}
fuzzy_th_response = client.search(index="search_character", query=fuzzy_th_payload)
for hit in fuzzy_th_response['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])
From these fuzzy searches you will get SCORE from ‘_score’.
Next let’s see the semantic search.
query_vector_en = english_embedder.encode(en_question)
semantic_query_en = {
"field": "vector_en",
"query_vector": query_vector_en,
"k":4,
"num_candidates": 20
}
semantic_resp_en = client.search(index="search_character", knn=semantic_query_en)
for hit in semantic_resp_en['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])
query_vector_th = thai_embedder.encode(th_question)
semantic_query_th = {
"field": "vector_th",
"query_vector": query_vector_th,
"k":4,
"num_candidates": 20
}
semantic_resp_th = client.search(index="search_character", knn=semantic_query_th)
for hit in semantic_resp_th['hits']['hits']:
print(hit['_score'])
print(hit['_source']['th_character'])
From these vector searches you will also get SCORE from ‘_score’.
RAG improvement for language which has no re-ranker (Thai language)
In case your user use English, it is OK to skip these scores because you have the re-ranker model. (REF: https://medium.com/towards-generative-ai/improving-rag-retrieval-augmented-generation-answer-quality-with-re-ranker-55a19931325)
But for the Thai language or other none-English, we may need to use the traditional NLP as a score then, do the ranking by the scores we’ve got (semantic plus fuzzy).
We have scores, we can weighted and boosted these scores as a re-ranker (statistical/mannualy) based on your use case preference.
How to use elasticsearch with IBM cloud please visit: https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-elser-embeddings-elasticsearch