Ultimate Semantics Search — Part 1 — Haystack Framework

9 min readMar 10, 2024

Haystack emerges as a cutting-edge open-source framework designed to seamlessly integrate large language models (LLMs) into production-ready applications, revolutionizing how businesses leverage AI. It offers unparalleled flexibility, allowing developers to employ the latest LLMs from industry giants like OpenAI, as well as open-source and other pre-trained models.

Haystack consolidates essential tooling in one comprehensive package, including preprocessing, pipelines, agents and tools, prompts, evaluation, and finetuning capabilities. Furthermore, it supports a variety of databases, such as Elasticsearch, OpenSearch, Weaviate, Pinecone, Qdrant, Milvus, and more, providing the freedom to choose the most suitable database for any given application. Designed to scale, Haystack’s robust retrieval architecture can handle millions of documents, making it a go-to solution for developers aiming to build scalable, efficient, and powerful LLM-based applications.

Semantic Search

Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that literally matches words in the query.

Semantic search is a set of search engine capabilities, which includes understanding words from the searcher’s intent and their search context.

How does it work?

Semantic search is powered by vector search, which enables semantic search to deliver and rank content based on context relevance and intent relevance. Vector search encodes details of searchable information into fields of related terms or items, or vectors, and then compares vectors to determine which are most similar.

A vector search-enabled semantic search produces results by working at both ends of the query pipeline simultaneously: When a query is launched, the search engine transforms the query into embeddings, which are numerical representations of data and related contexts. They are stored in vectors. The kNN algorithm, or k-nearest neighbor algorithm, then matches vectors of existing documents (a semantic search concerns text) to the query vectors. The semantic search then generates results and ranks them based on conceptual relevance.

When a query is launched, the search engine transforms the query into embeddings, which are numerical representations of data and related contexts. They are stored in vectors.
The kNN algorithm, or k-nearest neighbor algorithm, then matches vectors of existing documents (a semantic search concerns text) to the query vectors.
The semantic search then generates results and ranks them based on conceptual relevance.

Haystack Semantic Search Modules

There are 4 important modules that we should use to build a semantic search with Haystack:

1. DocumentStore
You can think of the DocumentStore as a database that stores your texts and meta data and provides them to the Retriever at query time. Learn how to choose the best DocumentStore for your use case and how to use it in a pipeline. There are many DocumentStore that available at the moment i.e. Elasticsearch, Milvus, MongoDB, OpenSearch, Pinecone, Qdrant, SQL, etc.

2. Retriever
The Retriever performs document retrieval by sweeping through a DocumentStore and returning a set of candidate Documents that are relevant to the query. See what Retrievers are available and how to choose the best one for your use case. In a query pipeline, the Retriever takes a query as input and checks it against the Documents contained in the DocumentStore. It scores each Document for its relevance to the query and returns the top candidates.

When used in combination with a Reader, the Retriever can quickly sift out irrelevant Documents, saving the Reader from doing more work than it needs to and speeding up the querying process.

In indexing pipelines, vector-based Retrievers take Documents as input, and for each Document, they calculate its embedding. This embedding is stored as part of the Document in the DocumentStore. If you’re using a keyword-based Retriever in your indexing pipeline, no embeddings are calculated. The Retriever creates a keyword-based index that it uses for quickly looking Documents up.

3. Ranker
Rankers reorder documents based on a condition such as relevance, or recency. The improvement that the Ranker brings comes at the cost of some additional computation time. Haystack supports various ranking models such as transformer models and Cohere models.

4. Reader
The Reader takes a question and a set of Documents as input and returns an Answer by selecting a text span within the Documents. Readers use models to perform QA. Learn about the available Reader classes and the recommended models.

Pros
- Built on the latest transformer-based language models.
- Strong in their grasp of semantics.
- Sensitive to syntactic structure.
- State-of-the-art in QA tasks like SQuAD and Natural Questions.

Cons
Requires a GPU to run quickly.

Use Case

We will create a semantic search to perform query from Digimon dataset. This process required stable internet connection because we need to download torch and haystack inference models (retriever, ranker, reader, etc.). If your computer has GPU (NVIDIA, AMD, etc.) then you can enable the configuration to run at the top of the GPU. So it will increase the performance of Inference when doing Indexing and Searching.

We split the process into 2 phases:
1. Creating Index
This process will create a DocumentStore by doing embedding data into vector and store it in Elasticsearch index. We will use Elasticsearch because it is compatible with many Retriever models and easy to setup using Docker. There is Faiss DocumentStore that we do not need to install anything using Docker but it is easier to monitor the result (generated vector) using Kibana dashboard that connected to Elasticsearch.

2. Performing Search/Query
The process will read the document store and create a pipeline to combine all modules (Retriever-Ranker-Reader). After that we can run the pipeline to get the top K result and show the result through console.

Development

This section is sharing how to develop the Semantic Search using Haystack and please note that the I am using virtual environment with Python 3.9.13.

Package

This is the list Python package that we need to install. My suggestion is to create a file requirements.txt and put all the packages in here. After that we can do “pip install -r requirements.txt” to install all of them in one go.

farm-haystack==1.24.1 -f https://download.pytorch.org/whl/torch_stable.html
farm-haystack[faiss]==1.24.1
farm-haystack[elasticsearch]==1.24.1
sentence-transformers==2.3.1
pandas==2.2.0
Pyarrow==15.0.0
python-dotenv==1.0.1

Initialization

There is no sophisticated initialization at this project, it is just defining the package, env and load the variables.

from haystack import Pipeline
from haystack.document_stores import FAISSDocumentStore
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import DensePassageRetriever
from haystack.nodes import SentenceTransformersRanker
from haystack.nodes import FARMReader
from dotenv import load_dotenv
import pandas as pd
import os
import ast

# Define env variable
load_dotenv(".env")

# Define constants variable
FAISS_DB_PATH = os.getenv('faiss_db_path')
FAISS_INDEX_PATH = os.getenv('faiss_index_path')
FAISS_CONFIG_PATH = os.getenv('faiss_config_path')
ES_HOST = os.getenv('es_host')
ES_PORT = os.getenv('es_port')
ES_SCHEME = os.getenv('es_scheme')
ES_VERIFY_CERTS = os.getenv('es_verify_certs')
ES_CA_CERTS = os.getenv('es_ca_certs')
ES_USERNAME = os.getenv('es_username')
ES_PASSWORD = os.getenv('es_password')
ES_EMBEDDING_DIM = os.getenv('es_embedding_dim')
ES_PREFIX_INDEX = os.getenv('es_prefix_index')

Create Index

As I mentioned in DocumentStore section, we will use Elasticsearch to store the embedding data. We can use username-password or API Key-Secret to access the Elasticsearch from our code.

Haystack provide many Retriever that we can use to help to do the embedding process (converting data into vector). But not all Retrievers are able to send the embedding result to DocumentStore. Example SQL DocumentStore can’t be used as a target to store the embedding result. So, the best practice is always to check the DocumentStore compatibility table and compare all the result from different models.

You can find the Retriever documentation in here and list of Semantic Search pretrained models in here. You can use Retriever model all-mpnet-base-v2 as the starting point to do the semantics embedding. But after I did some testing, facebook/dpr from hugging face gives more better result in semantics search/query compare to mpnet.

def create_index(documents, type):
    # Define ES document store
    document_store = ElasticsearchDocumentStore(host=ES_HOST,
                                                port=ES_PORT,
                                                scheme=ES_SCHEME,
                                                verify_certs=ast.literal_eval(ES_VERIFY_CERTS),
                                                ca_certs=ES_CA_CERTS,
                                                username=ES_USERNAME,
                                                password=ES_PASSWORD,
                                                embedding_dim=int(ES_EMBEDDING_DIM),
                                                index=ES_PREFIX_INDEX + type + ES_EMBEDDING_DIM)

    # Define Retriever
    retriever = DensePassageRetriever(
        document_store=document_store,
        query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
        passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
    )

    # Update Doc Embedding
    document_store.write_documents(documents)
    document_store.update_embeddings(retriever)

For dense neural network-based retrievers like Dense Passage Retrieval or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by write_documents() and the computation of the embeddings is handled by update_embeddings(). This step is computationally intensive since it will engage the transformer-based encoders. Having GPU acceleration will significantly speed this up.

Perform Query

We split the process into 6 parts:
1. Load the DocumentStore from Elasticsearch that has been indexing in previous function
2. Use Retriever to do the similarity and scoring between embedding input and DocumentStore
3. Use Ranker to improve the result
4. Use Reader to take any question as input
5. Assemble Retriever-Ranker-Reader with Pipeline and run the query
6. Show query result from Pipeline

def perform_query(query_string, N, type):
    # Define ES document store
    document_store = ElasticsearchDocumentStore(host=ES_HOST,
                                                port=ES_PORT,
                                                scheme=ES_SCHEME,
                                                verify_certs=ast.literal_eval(ES_VERIFY_CERTS),
                                                ca_certs=ES_CA_CERTS,
                                                username=ES_USERNAME,
                                                password=ES_PASSWORD,
                                                embedding_dim=int(ES_EMBEDDING_DIM),
                                                index=ES_PREFIX_INDEX + type + ES_EMBEDDING_DIM)

    # Define Retriever
    retriever = DensePassageRetriever(
        document_store=document_store,
        query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
        passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
    )

    # Define Ranker
    ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")

    # Define Reader
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

    # Create Pipeline
    query_pipeline = Pipeline()
    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
    query_pipeline.add_node(component=reader, name="Reader", inputs=["Ranker"])

    # Perform Query
    top_k_retriever = N
    top_k_reader = N

    results = query_pipeline.run(query=query_string,
                                 params={
                                     "Retriever": {"top_k": top_k_retriever},
                                     "Reader": {"top_k": top_k_reader}
                                 })
    print("Query:", query_string)
    for row in results['documents']:
        print(f"ID: {row.id}, Content: {row.content[:100]}, Score: {row.score}")
    print("Done...")

We can’t change the Retriever model in the query because we already define specific Retriever model name when doing the indexing process. We can change Ranker and Reader model to improve the result from Retriever. Also we can limit number of result using filter top_k under params Pipeline Query.

Main

Once we have prepared indexing and query function, we can move to the main part to call dataset and run the semantic search. The input is digimon dataset and we combine all the columns into single column name “content”. After that we convert the dataframe into dictionary that consist of id (Number) and Content.

if __name__ == "__main__":
    # Load Data
    df = pd.read_csv("./Data/DigiDB_digimonlist.csv")
    df['content'] = df.apply(lambda row: ', '.join([f"{index} {value}" for index, value in row.items()]), axis=1)
    documents = df[['Number', 'content']].to_dict(orient='records')

    # Define Index Name
    index_name = 'digimon'

    # Perform Indexing
    create_index(documents, index_name)

    # Perform Searching
    perform_query('Garurumon', 5, index_name)

It is more interesting if we can connect the Haystack with FastAPI and perform more complex instruction such as reading data from PDF, running chain message powered by OpenAI, or something else. I will show you more interesting case in other my medium article.

Result

This is the result when we are running the main program (console view):

If you are curious how the vector result from embedding process, we can go to Kibana and open the Index Pattern in there. So it will look like this:

Summary

Haystack is one of the best Semantic Search frameworks that we can use, and it is production-ready, which means we can define any modules that already have pre-trained models. If we need to do some fine tuning, there is documentation here that we can use as a sample to optimize the Haystack module. Haystack has many integrations with big LLM (OpenAI-Cohere) and Open Source LLM (mpnet, fb, etc.). Although there is much room to improve the framework, like compatibility, this framework is still good and ready to use in production because of the modularization and performance.

GitHub Repository

https://github.com/dmitrimahayana/Py-Haystack-Semantic