Build a search engine in 5 minutes using Qdrant

Armaghan Shakir
3 min readFeb 12, 2024

--

Qdrant is an open-source vector search engine designed to efficiently handle and search through large collections of high-dimensional data. It is particularly well-suited for tasks involving similarity search, such as recommendation systems, image or text retrieval, and clustering.

In this step to step guide, I will show you how to build a search engine using qdrant in 5 easy minutes.

Links

Dataset

For an example here I have here used Quora Question Pairs Dataset on Kaggle. The Quora Question Pairs dataset, is a popular dataset used for natural language processing (NLP) tasks, particularly for tasks related to semantic similarity and duplicate question detection. In our case we are going to use it to develop a search engine

Setting Up

  1. Join the Quora Question Pairs Competition on Kaggle
  2. Download the file train.csv.zip
  3. Unzip the downloaded file.
  4. Save the path to the dataset in DATA_PATH variable.
DATA_PATH = "/kaggle/working/train.csv" # path to your train.csv file

Initlialize Constants

# Name of Qdrant Collection for saving vectors
QD_COLLECTION_NAME = "collection_name"

# Sample size since the complete dataset is very long and can take long processing time
N = 30_000

Loading Dataset

Now you need to load dataset using pandas in python

import pandas as pd

df = pd.read_csv(DATA_PATH)

print("Shape of DataFrame:", df.shape)
print("First 10 rows:")
df.head(10)

Questions

Next step is to extract questions from the dataframe, remove duplicates and create sample to see results on a part of data

# extract the questions from df
questions = pd.concat([df['question1'], df['question2']], axis=0)

# remove all the duplicate questions
questions = questions.drop_duplicates()

# print total number of questions
print("Total Questions:", len(questions))

# sample questions from complete data to avoid long processing
questions = questions.sample(N)

# print first 10 questions
print("First 10 Questions:")
questions.iloc[:10]

Qdrant

First, install qdrant with fastembed. Fastembed is a fast, accurate, lightweight python library to make state of the art Embedding. Qdrant uses it to upserts textual data directly into the Vector Database and takes care of the embeddings.

!pip install qdrant-client[fastembed]

Then add the documents to qdrant vector space

from qdrant_client import QdrantClient

client = QdrantClient(":memory:")

client.add(
collection_name=QD_COLLECTION_NAME,
documents=questions,
)

print("Completed")

I have used the option of :memory: here, that stores all vectors in memory which is not preferred for production. You should run a qdrant database server as a docker container or provide link to qdrant cloud server, in that case you can read this documention.

Now create a search function that processes the query and prints the results

def search(query):
results = client.query(
collection_name=QD_COLLECTION_NAME,
query_text=query,
limit=5
)
print("Query:", query)
for i, result in enumerate(results):
print()
print(f"{i+1}) {result.document}")
search("what is the best earyly morning meal?")
search("How should one introduce themselves?")
search("Why is the Earth a sphere?")

Explore

--

--