How to combine vector search with filtering in ElasticSearch

Fatihsati
5 min readAug 13, 2023

--

Photo by Gabriel Sollmann on Unsplash

Large language models (LLM) are evolving every day, and this situation contributes to the expansion of semantic search. LLMs are excel at analyzing texts and revealing semantic similarities. This situation also reflects on search engines because semantic search engines can provide users with more satisfying results.

Although large language models can capture semantically close results, implementing filters within search outcomes is crucial for enhancing the user experience. For example, incorporating filters based on dates or categories can significantly contribute to a more satisfying search experience. So, how can we effectively combine semantic search with filtering?

Let’s begin with ElasticSearch connection and basic search queries first:

from elasticsearch import Elasticsearch
import config as cfg

client = Elasticsearch(
'https://localhost:9200',
ssl_assert_fingerprint=cfg.ES_FINGERPRINT,
basic_auth=('elastic', cfg.ES_PASSWORD)
)

I read the necessary connection information from a config file for the connection, and these details are automatically provided when Elasticsearch is launched for the first time.

[
{
"title": "Data Structures and Algorithms",
"date": "2023-08-02",
"author": "Emily Johnson"
},
{
"title": "Artificial Intelligence Trends",
"date": "2023-08-01",
"author": "William Smith"
},
...
]

The dataset I will use throughout this post was generated by ChatGPT and follows the format as described above.

Let’s read our data using this JSON file and create an Elasticsearch index according to this format, then add the data into it.

book_mappings = {
"mappings": {
"properties": {
"title": {"type": "text"},
"author": {"type": "text"},
"date": {"type": "date"}
}
}
}

client.indices.create(index = "book_index", body=book_mappings)

import json
with open('data.json', 'r') as f:
data = json.load(f)

for each in data:
client.index(index='book_index', document=each)
client.indices.refresh()

You may find the codes and the dataset here.

In the dataset we’ve created, there are 3 fields, two of which are formatted as text, and one as a date. Afterwards, we utilize this mapping to create an index, naming it “book_index”. Since our data and index are in the same format, there is no need for any additional processing at this stage.

Let’s start with a query that will retrieve all documents within the index:

match_all querysi ve onun cevapları
match_all query and its results

In order to apply filtering to the documents within the index, we need to modify the “query” parameter. To search for words within the text, we will use the “match” keyword:

Filtering document by matching keyword

We listed the documents within the index that have the word “Data” in their “title” field.

If you want to apply filtering across multiple fields, you can achieve this using the “bool” operation. If there are fields for which you don’t want them to affect scores in your search, you can specify them within the “filter”.

Elasticsearch search query with bool operation

For more information on Elasticsearch queries you may check here.

Now, let’s create the same index with document vectors included. For this post, I’ll be using the Sentence-Transformers library and the ‘all-mpnet-base-v2’ model. There is no restriction on the model usage so you may choose any model you want. You can explore more models here.

vector_mapping = {
"mappings": {
"properties": {
"title": {"type": "text"},
"author": {"type": "text"},
"date": {"type": "date"},
"vector": {
"type": "dense_vector",
"dims": 768,
"index": True,
"similarity": "dot_product"
}
}
}
}

client.indices.create(index='vector_index', body= vector_mapping)

While creating the “vector_index” this time, we are adding an additional field of type “dense_vector” and specifying the parameters for vector search: The “dims” parameter represents the dimensionality of the vector produced as output by the used model. “Similarity” determines the method to measure vector similarity. You can explore different “similarity” values here.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

for each in data:
each['vector'] = model.encode(each['title'])
client.index(index='vector_index', document=each)
client.indices.refresh()

Let’s load the model using the Sentence-Transformers library and extract vectors from the “title” sections of the dataset. We will then add these vectors to each data entry and proceed to add this data to the “vector_index” index.

In order to perform vector search within Elasticsearch, we first need a query text and then its corresponding vector representation.

Important Note: The model used to obtain the query vector should be the same as the model used when indexing the documents; otherwise, achieving accurate results would be quite challenging.

To perform vector search, the Elasticsearch.search() function uses the “knn” parameter. An example of a query for “knn” is shown in the image below. The “k” value indicates how many results you want to retrieve, while “num_candidates” specifies how many candidate documents will be taken into the pool for calculations. “query_vector” is the vector representation of the query text (in our case “HTML and CSS programming”). You can find detailed information about knn query parameters here.

Vector search for “html and css programing” query

The results returned for the example query are visible in the image above. Even though none of the returned results contain exactly the same words, they have successfully captured semantically similar results.

So, if we also want to use these semantic search results in conjunction with filtering, how should we prepare the “knn” query?

Filtering vector search

Each filter we apply is provided as a “filter” within the “knn” parameter. You can add as many filters as you want here and combine the results based on these filters. In the example above, both date and keyword filters have been added together, aiming to list documents that are semantically close and contain the word “Development” while having a date later than July 1, 2023.

Important Note: Elasticsearch performs filtering after the vector search process, so there might be instances where it can’t return exactly “k” number of results. In the image above, even though the “k” value is set to 5, the query has returned 3 documents as results. This is due to the fact that, in the example dataset prepared, only 3 documents meet the specified criteria.

If you found this article helpful or have any question feel free to contact me at LinkedIn.

--

--

Fatihsati

Interested in Machine Learning, particularly in Natural Language Processing. I Love learning and sharing my knowledge. Contact at fatihsati@gmail.com.