Building a Semantic Search: What we had wished to know when starting our journey (Part I)

Published in

CONTACT Research

6 min readFeb 9, 2024

This is the second part of our series about the progression of our LLM project. Missed the introduction about how we built the team? You can find it here!

Founded in 2022, the Contact Research department had Artificial Intelligence as one of its planned research areas from the very beginning. Initially, we engaged in smaller projects in the field of classical data analysis. While we were busy with our initial experiments in this area, ChatGPT suddenly burst onto the scene and prompted us to adapt our future work plans.

The Idea: What do we want to do?

Fueled by the hype surrounding ChatGPT, the idea quickly emerged that in our next project, we wanted to work with some kind of Large Language Model to gain deeper experience with this technology that was new for all of us. Nonetheless, we didn’t want a purely academic research question; instead, we desired a use case that, assuming good results, could be seamlessly integrated into our commercial software product as a new feature.

In our quest to find a potential use case within our company, we quickly saw the option of improving the search function present in our product. Currently, a traditional keyword search is implemented, which works well but, of course, has its known limitations such as a lack of synonym capability. By using a Large Language Model (LLM), we aimed to bypass these limitations in the future and allow users to utilize imprecise terms and natural language queries for their searches.

The Technique: What is Semantic Search?

In short, semantic search is the idea of amending classical keyword search by also finding results which are semantically close to the search terms but do use a different wording. It includes different techniques like named entity recognition, the use of ontologies and thesauri, query expansions and semantic indexing. Semantic indexing means that the content of source texts is embedded into a vector database using a text embedding LLM, and every incoming query is embedded with the same model, and then the answer is retrieved by performing a nearest neighbour search within the vector database.

Once the fundamental idea was established, we needed to specify our use case. In doing so, we first had to identify a suitable data source for our initial experiments. Several questions needed to be considered in this regard:

The Dataset: Quantity, quality, availability, and security

Quantity:
The dataset must be sufficiently large to enable meaningful results. At the same time, it should not be excessively large, as this could lead to challenges in terms of processing times or acquiring computing resources early in the project, which would hinder rapid project progress.
Quality:
To test new methods, it is essential to ensure that the dataset also offers sufficiently high input quality to make meaningful assessments of the advantages and disadvantages of the tested methods. Otherwise, there’s a risk of introducing random bias into the method comparisons.
Availability:
To expedite progress in the project, it’s advisable to choose a data source for the initial approach that is either already available or can be made easily accessible.
Security:
Certain data may be subject to confidentiality requirements, such as containing sensitive company internals or personal information. Depending on the tools and methods used to process the data, the protection of this data cannot always be guaranteed (for example, if processing occurs on external servers). To address this, one must either select or tailor the dataset beforehand to exclude critical information, or consider anonymization or encryption methods.

Taking these criteria into consideration, we ultimately decided to use the official documentation of our software as the data source. It is easily accessible to us, well-maintained, and contains only information that we distribute to all our customers in the same form, making it non-sensitive regarding confidentiality concerns.

The Plan and its Implementation: What we still needed to clarify

After choosing the data source, we needed to plan in greater detail how our search function should work. We decided that for our first try, we wanted to focus on semantic indexing. That meant, we needed to break down the text of our software documentation into short snippets, embed them using a Large Language Model (LLM) for text embedding, and store them in a vector database. Incoming search queries should also be embedded using the same LLM, and then a nearest-neighbor search would be performed on the data (checking for sentence similarity). The results would be output as a sorted list based on the similarity value, with each entry providing the text of the snippet and a link to the documentation so that users could easily locate the original source.

Once we had this great plan for our search function, we thought we just needed to “quickly implement it”… Spoiler alert: We were not entirely correct 😉 In practice, there were many aspects we hadn’t considered yet, and we had to address them along the way:

The Infrastructure: Pipeline, vector database, computational capacity, and database structure

Pipeline:
You can either code the processing pipeline for your LLM project entirely on your own or make use of frameworks such as Haystack or LangChain. We tested both of these and were happier with LangChain, since we found it to be easier to make the pipeline work; nonetheless, rapid development is going on in this area, so we can only advise checking which option appears most suitable at the start of your project.
Vector Database:
There is currently a wide range of vector databases available, such as Milvus, Qdrant, Vespa, Pinecone and Weaviate; and also database systems which have been expanded to support vector search, like ElasticSearch and PostgreSQL. In our project, we were happy using Qdrant (since it is very user friendly and was easy to set up), which we initially used as a cloud solution and later as a local installation. Most providers offer both options, and the main factor in your decision should be whether you have your own storage and computing resources or prefer to purchase them externally.
Computational Capacity:
For calculating embeddings for sufficiently large amounts of text, normal laptops or desktop computers are generally not suitable. Therefore, you may need to rely on other resources, such as existing servers, or use one of the well-known cloud providers.
Database Structure:
In essence, you need a server with good computational power for performing embeddings and another server for hosting the data in a database. These servers can be the same or different. For example, we were satisfied with performing the embedding calculations in-house while storing the embedded data externally with a cloud-provider. This allowed us to maintain control over the data content (the vectors which were stored in the cloud didnt’t contain the plain text), without requiring too much local storage capacity. Additionally, building the database involved planning the data structures concretely to set up the appropriate tables with all the necessary fields. We handled this task later in the project and had to make adjustments repeatedly. It would be advisable to think the architecture through thoroughly beforehand to avoid such issues.

When the setup of the infrastructure is completed, there is still plenty left to discuss and decide in an LLM project: What has to be done as preprocessing of the dataset? Which text embedding model and which distance function should be used, do we want to test several of them? If so, how can we decide which one performed best? How should the results be made accessible to users? Still so much to think about, so keep looking forward to part III of our series!

About CONTACT Research. CONTACT Research is a dynamic research group dedicated to collaborating with innovative minds from the fields of science and industry. Our primary mission is to develop cutting-edge solutions for the engineering and manufacturing challenges of the future. We undertake projects that encompass applied research, as well as technology and method innovation. An independent corporate unit within the CONTACT Software Group, we foster an environment where innovation thrives.