Being a Speaker “Soft Introduction to Vector Database” as an Intern

Published in

tiket.com

9 min readJun 4, 2023

Few weeks ago, I was having a biweekly meeting with one of the DS Leads. At the same moment, she suddenly asked me to be a speaker in our upcoming internal sharing sessions. Usually, the speaker is the full time employees, leads, or even managers. Sometimes, interns also get the chance at the end of his/her contract. Hence, I really was not expecting that since my contract was not ending in any time soon (at that time). But, oh, well, I am still blessed to have such a great opportunity.

When I wrote this article, I was working for PT. Global Tiket Network, known as tiket.com in Indonesia. Here, we have sharing sessions conducted regularly as a part of our culture. Every sharing session will have different presenters and different topics. The topic is up to the speaker, but I was not sure what I should talk about since our team is filled with Masters and PhDs. I thought most of my knowledge would already be covered by them because I even have not completed my undergraduate.

Therefore, I asked my mentor for advice. He told me to talk about Vector Database since it is a relatively new tool for us. So, other members might not know about it. Still, I was in doubt since it is also novel to me, but I am willing to learn about it. To convince me, my mentor said, “It does not have to be detailed and expert. Just present it as an fyi knowledge. It is to let DS fellows know that we do have this kind of technology. With that, we can expand our possible solutions.”

… and so I took the chance.

First few days researching it was overwhelming for me since I did not know which to read first. But, several days later, I started to get a bit grasp on it. So, from this point forward, I will give a brief summary about Vector Database based on my little research. Disclaimer: I am also very new to this topic. Thus, I may miss some details and some information may be inaccurate. Feel free to correct me! :)

What is a Vector Database?

Vector database basically has a similar function to traditional database, but imagine it this way: it has a special column that can store vector embeddings and use it for fast retrieval and similarity search. With that in mind, vector databases are also built with the utility to create, read, update, and delete (CRUD) as it is done in traditional databases.

In addition to what I have mentioned above, vector databases have metadata filtering. Thus, when we are querying with vector embeddings, we can filter our result based on the metadata.

For example, let’s say that we have a database that stores a lot of apparel. When we want to search for similar items with [1, 0.2, 0.8, …, 0.6] embedding, we can specify the price or the color. So, in natural language, we would command the vector database to

“Hey, please search for whatever items that are priced under $100 and colored red, but similar (or near) this embedding [1, 0.2, 0.8, …, 0.6].”

There are 3 methods for this metadata filtering: pre-filtering, in-filtering, and post-filtering.

Scalability wise, they have horizontal scaling functionality. As it is common with traditional databases, some of them implement sharding, and some others are built on top of Kubernetes! So scalability should not be a big concern?

Why Vector Database?

As you may have seen in the previous section, vector databases are powerful when it comes to utilizing embedding for similarity search. Now, the question is, Aren’t they the same as the ones that Elasticsearch has? Those TF-IDF or BM25?

The answer is: no, they are not the same. Hence, we need to differentiate between keyword search and vector search. Suppose we want to search for an image with input query. For keyword search, we might get

Query: Dog cuddling with cat on the grass
Result:

I mean, it is not totally wrong, they still look cute anyway! But, that is still acceptable, not correct. Try to compare it with vector search,

Query: Dog cuddling with cat on the grass
Query embedding: [0.07289, -0.227076, 0.20138, …]
Image embedding: [0.08213, -0.271234, 0.01812, …]
Result:

See the difference? :)

The keyword search might only depend on the description or the title of the image. However, when the description and the title are not representative of the image, we can’t find them using keyword search.

Other than that, keyword search has several disadvantages. Ask yourself these questions:

Algorithms like TF-IDF are only available for texts. What if your data are images, videos, or audios?
TF-IDF is also known to suffer with word ordering. How do you distinguish “Trip from Japan to Rome” from “Trip from Rome to Japan”?

These problems motivated us to utilize vector search (which is powered by vector databases). Other than that, vector databases also have more power when it is related to distance. For instance, it can:

Deduplicate: eliminate similar or near-duplicate records.
Detects anomalies a.k.a. “reverse near”: find distant objects.

Now that you have an understanding of keyword search vs vector search, you might want to know the difference between traditional databases and vector databases. In a nutshell, traditional DB returns exact match, while vector DB can return near-neighbors.

More on Vector Databases

Other things that you may find interesting is the vector search pyramid. I got this vector search pyramid from Dmitry Kan. It explains the stacks that you can use with vector databases.

We start to read the pyramid from the lower level,

KNN / ANN algorithms: this empowers the vector databases to query using certain algorithms. So, in natural language, we might say,

“Hey, please find K nearest neighbors for this [1, 0.2, 0.8, …, 0.6] embedding using XXX algorithm.”

Usually, vector databases use approximate nearest neighbors since KNN algorithm is exhaustive and is not efficient for most use cases. Several vector databases such as Milvus or Qdrant has HNSW (hierarchical navigable small world) algorithm as a built-in indexing algorithm.

Vector databases: other than having built-in indexing algorithms, it has other capabilities to do metadata filtering and horizontal scalability (as I have mentioned before).
Neural frameworks: frameworks help developers to build end-to-end search systems. It can be useful in integrating vector databases and encoders.
Encoders: generate embedding for unstructured data, such as images, texts, videos, or audios. This embedding can be used when saving data to a vector database or when a user queries something (user text input is translated to embedding first, then the embedding is queried to vector database).
App and user interface: self explanatory.

Examples of Vector Databases

For the vector database examples, I took heavy reference from this article. I have not yet explored and studied each database in detail. The author also shared his knowledge through a talk. So, I think it is wiser for you to visit that page as well! :D

Vector Databases Benchmarks

Quick Demo / Tutorial

For the demo, I used Milvus as a vector database and Attu as its management tools. Both can run locally using Docker. The purpose of this section is to demonstrate how we can query a vector database using index and vector embedding.

Milvus

1. Download Milvus docker-compose.yml here.

2. Run the following command. Use `sudo` if you are using Linux.

docker compose up -d

It will start to gather the resources to have the container up and running. Just wait until the process is finished. Because I am using Windows, my screen looks like this when it is finished.

3. Check the status using the following command.

docker ps

It should look like this.

4. You are done with Milvus, let’s move on to Attu.

Attu

1. Running Attu is simple, just type

docker run -p 8000:3000 -e HOST_URL=localhost:8000 -e MILVUS_URL={YOUR_LOCAL_IP}:19530 zilliz/attu:latest

To get local IP, you can check using ipconfig on the terminal.

2. After some time, you will see on the terminal

and you can open Attu at http://localhost:8000/. You can close the terminal now. Let’s continue on Attu Dashboard.

3. On Attu, you will see something like this. Login using the IPv4 address and port of Milvus. For me, I don’t set the username and password for Milvus, so I leave it empty.

4. Let’s try to make a simple book collection. On the dashboard, open the Collection page. Navigate using the left navbar.

5. Click on `Create Collection`. I will use the following schema. Feel free to modify and explore yourself. Click Create when you are finished.

6. After that, we will create the index. Click on the Book collection. On the `embedding` column, click `CREATE INDEX`. We will use IVF_FLAT for this one.

7. Then, we want to import data for the book, so we will prepare a csv. The CSV should contain embedding, title, and is_available (all columns other than id since our id is auto generated). Here is my example.

8. After that, we import the data using Attu.

Choose the CSV file you have created > Next > Import Data.

9. Hover on the Book collection, you will see on the right-hand side there is a `load` button. Click on that.

10. After it says `Loaded`, you can move to the `Vector Search` page using the navbar. Choose the Book collection and input the vector embedding. I will query using one of the embeddings in my data. Hence, the first nearest result expected should be having the distance of 0.

Yeay, you have successfully completed the demo! Feel free to explore more than this! :)

Acknowledgement

Thank you Ci Elisafina Siswanto as the sharing session coordinator, for offering me this opportunity.
Thank you Pak Muhammad Adib Imtiyazi as my mentor, for suggesting and encouraging me.
Thank you Maz for helping and assisting me throughout the session.
Thank you Pak Setia Budi for motivating me to write this article.
Thank you DS & MLE team for actively asking and discussing in my session.