Building Search Engine App with Redis & SaturnCloud

4 min readNov 3, 2022

The second hackathon of the mlops.community Engineering Lab (Hackathon) was held in collaboration with Redis, NVIDIA Inception, and Saturn Cloud. The main focus of the hackathon was Vector Search, and the dataset used for the project was the arXiv scholarly papers dataset.

Download the Dataset from kaggle

The first step was to download the proposed arXiv dataset from Kaggle in the Saturn cloud notebook. This was done by creating a Kaggle API Token, which was fetched from the Kaggle Account tab of the My Profile section.

Clean the Dataset

Once the dataset was downloaded, we created the variables DATA_PATH, YEAR_CUTOFF, YEAR_PATTERN, and ML_CATEGORY to clean and reduce the data size. By reducing the data size, we ensured that the subsequent steps were more efficient and faster.

Create Embeddings

We used the Sentence Transformer model (mpnet-base-v1) to generate embeddings (vectors) for the cleaned dataset. The Sentence Transformer model is a deep learning model for encoding text into fixed-length vectors, known as embeddings. The mpnet-base-v1 model is trained on a diverse range of natural language processing tasks, such as semantic textual similarity, sentiment analysis, and classification.

By generating embeddings for the cleaned dataset, we were able to represent the paper abstracts as numerical vectors, which can be compared using mathematical operations. This makes it possible to perform tasks such as finding similar papers based on the abstract content.

Loading Vectors into Redis

Redis recently introduced the Vector Similarity Search feature, which provides the ability to query, index, and perform full-text searches on Redis data stored as Redis hashes or JSON format.

In the Redis Vector Loading session, the load_vectors() function was used to create a hash key and map the input paper dictionaries to Redis hashes. The hash key is used to store the embeddings (vectors) in Redis, making them available for searching and indexing.

Redis supports two popular indexing methods for vector similarity search: FLAT and HNSW. In this project, we used the HNSW index to calculate the Top K Approximate Nearest Neighbors of a given vector. Redis also supports three common vector distance metrics for querying vectors: cosine, inner product, and Euclidean distance.

Querying using K-nearest neighbor from Redis

The final step was to query the nearest neighbors using the K-nearest neighbor (KNN) algorithm. The KNN algorithm is a simple machine learning algorithm that classifies a given data point based on its K nearest neighbors.

In this project, the KNN algorithm was used to find the K nearest neighbors for each paper in the dataset, based on the similarity of their embeddings. The Redis Vector Similarity Search feature made it possible to perform the KNN search in an efficient and scalable manner, making it possible to search through large datasets quickly and easily.

In conclusion, the use of Redis and SaturnCloud made it possible to build a highly efficient and scalable vector search application. The combination of the Sentence Transformer model for generating embeddings and the Redis Vector Similarity Search feature for indexing and searching made it possible to perform complex natural language processing tasks with ease