Vector database, a gentle introduction (+ hands-on)

Published in

Data Reply IT | DataTech

12 min readDec 18, 2023

So, I belong to that generation of computer science students who, during my University years, have attended multiple courses about databases, DBMS, archiving infrastructure etc.

Most of them were focused on the SQL approach, talking about the usefulness of ACID logic, rigid structure, rollback logic and strategies, logging approach, indexing of data, and pros and cons of each database engine — a lot of very useful stuff though, very educational and that all engineers should go through. However, I remember that my favourite section of these courses was, usually, the last one. When the words “No-SQL” appear.

I think this amazement of mine arose when, since high school, I was used to a relational, fixed world with fixed patterns. When you had to design a data collection system, you started by analysing the reality of the study, who the players were, how they interacted with each other, what information was useful to store, how you could uniquely identify each entity, what kind of relationship there was between them. And already here a first criticality of the SQL world comes to the surface: rigidity. It boasts of being able to find a pattern to reality, which is not fixed, but is by its very nature changeable. Maybe not today, not even in the next few months, but sooner or later things change and you have to adapt accordingly. With potentially a lot of work.

Born in the early 2000s, NoSQL databases have since grown and expanded into all market sectors. Among others, one of the main reasons why these databases have seen such wide adoption is the nature of the data itself, which increasingly sees a decrease in structured data to the benefit of unstructured data. It is currently estimated that by 2025, 80% of the data will be unstructured. Their complexity, size and velocity will be one of the biggest challenges that a new generation of Data Engineers and Data Scientists have to deal with. Unstructured data, like images, videos and documents, could represent a problem in a situation where it is necessary to define a storage and retrieval system, defined over similarity metrics, which takes in account different aspects as total size, query speed and general recall. In this context, vector databases could be a valid ally after careful data pre-processing.

Vector database is a new species of No-SQL database that stores data as high-dimensional vectors, also called embeddings, allowing efficient similarity search, nearest-neighbour queries, and other vector-specific operations. Leveraging on embedding models freely available, it is possible to translate each data to a vector and benefit from this approach.

In this article, we will get an introduction and a motivation for vector databases and then move on to a practical example of use.

So why wait let’s go!

What are vector databases?

Vector databases are, incredibly, collections of vectors representing data of interest. These databases index and historicism embeddings, in fact, our vectors; to be able to quickly retrieve them with search methodologies, among others, based on similarity to a certain input or query in general. Embeddings, i.e. vectors representing a certain piece of data, can be generated by AI models. We are in fact, for example, transforming a photo into a vector with a certain size that depends on the model used.

Basically, the operation of a vector database in its simplest form is as follows:

Given the unstructured data (whether this is an image, video or text), there must be an embedding model capable of transforming it into a vector, the vector could have a different size depending on the model used; thus, the vector space in which it will be placed could have different dimensions. In addition to the vector representation of the unstructured data, it tends to be possible to include further metadata such as path to file, textual information about the data, categories, etc.

The query procedure follows a very similar reasoning:

So, the input is passed into the embedding model used to insert the data, we go to make a query to the vector database by setting parameters that allow us to say ‘how far’ we want to go to retrieve the results. Simple, no?

Why vector databases?

But why would anyone choose to use vector databases? As far as ‘the general line’ is concerned, we can say that there are benefits in the following areas:

Similarity Search Capabilities: Vector databases are adept at conducting similarity searches, identifying the optimal match between a user’s query and a specific vector embedding These databases can seamlessly store and manage vast collections of vector embeddings, each representing extensive training data, numbering in the billions.
Scalability Mastery: Vector databases excel in managing substantial datasets. Their adeptness at storing and searching billions of high-dimensional vectors positions them as a top choice for large-scale machine-learning applications.
Search Performance: Vector databases are renowned for their exceptional high-speed search capabilities. Utilizing advanced indexing techniques, they guarantee swift retrieval of similar vectors, even within expansive databases.
Efficiency: Using dimensionality reduction techniques, they expertly compress high-dimensional vectors into lower-dimensional spaces without compromising crucial information. This dual capability underscores their efficiency, making them highly effective in both storage and computation tasks.
Versatility: Vector databases are characterized by their adaptability in supporting diverse data models. With the capacity to handle both structured and unstructured data, they emerge as a fitting choice for a broad spectrum of applications, including text and image searches, as well as recommendation systems.

Milvus, an open source solution

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible and provides a consistent user experience regardless of the deployment environment.

Milvus is a possible solution if there is a need to rely on a vector database. The first use is extremely simple and straightforward, as we will see in the hands-on. But let us proceed in order, here is a simple representation of the architecture made available in the official documentation:

As we can see, it is very similar to what we sketched initially, it is perhaps shown more precisely that, the unvectorised data must obviously be present on another storage and not within Milvus itself.

Some fundamental concepts are worth stating to fix them. Let us proceed.

Collection

In Milvus, a “collection” is a fundamental organizational unit used to manage and store vectors. It serves as a logical container for grouping related vectors, providing a structured and efficient way to organize and query large-scale vector data. The concept of collections in Milvus is integral to its functionality

Partition

As a Milvus collection grows, query performance may decline. To address this, Milvus employs partitioning, dividing the data into segments on physical storage based on specific rules. This optimizes query efficiency, particularly when dealing with relevant subsets of the data. Each partition becomes a logical unit, contributing to both organization and scalability in Milvus.

Segment

A segment in Milvus represents a physical storage unit for vectors within a collection. It’s a way to manage the underlying storage of data efficiently. Each segment can be thought of as a subset of vectors within a collection. When a segment becomes full or reaches a certain size, Milvus may create a new segment to accommodate additional vectors.

Indexing

The majority of vector index types supported by Milvus utilize approximate nearest neighbours search (ANNS) algorithms. Based on their implementation methods, the ANNS vector index can be classified into four distinct categories:

Tree-based index
Graph-based index
Hash-based index
Quantization-based index

Milvus provides several indexes, the one we are going to use is called IVF_FLAT.

Similarity Metrics

Milvus utilizes similarity metrics to measure vector similarities, and selecting an optimal distance metric is crucial for significantly enhancing classification and clustering performance. The metrics in Milvus are as follows:

Euclidean distance (L2)
Inner product (IP)
Cosine Similarity
Jaccard distance
Hamming distance

Again, the official documentation is comprehensive and provides all the necessary details to guide the choice of the best metric given the use case.

Hello world with Milvus

Let’s get our hands dirty, in order to be able to follow this section, it is a good idea if you verify the correct installation of the following software:

Python (via Anaconda)
Docker

We can start :)

Create a folder, called “Milvus”, and place in it with the Anaconda terminal.

mkdir Milvus
cd Milvus

After that, we can create a test environment with Anaconda. This step can also be considered optional, but it is always a good practice to create new independent environments whenever working on new projects. Then in your already opened Anaconda prompt, type:

conda create -n milvus-hello-world python=3.10

When asked if you want to proceed, type “y” and press enter. Once the environment is created we can activate it.

Once the environment is created we can activate it:

conda activate milvus-hello-world

Now install Milvus python SDK pymilvus for Milvus:

pip install pymilvus==2.3.3

In order to use this environment on jupyter notebooks, it is necessary to execute a couple of commands:

conda install jupyter

conda install -c anaconda ipykernel

python -m ipykernel install --user --name=milvus-hello-world

We can now launch the jupyter notebook:

jupyter notebook

and it should be possible to see the following environment:

Click on “milvus-hello-world” to create a notebook that we will use later.

Now, by clicking on this link, it will be possible to download the official Milvus docker-compose in order to install the vector database in a standalone approach.

Once it has been downloaded, put it in the “Milvus” folder and from the terminal, type:

docker compose up -d

Once you are back in possession of your terminal by typing:

docker ps

You should be able to see the following output:

Excellent! Milvus installed we can move on to the code!

Given the notebook created earlier, let’s start by putting down some code. For this demo we will do without an embedding model, we will be the model ourselves :)

First a little import:

import time
import random

import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

We can then connect to Milvus which, if you have followed this guide correctly, should be running on a Docker and be reachable on port 19530

connections.connect("default", host="localhost", port="19530")

Now suppose we have embeddings given image of different animal like dog, cat, bird etch, in addition to the image vector we also have the race of the animals. We must then define the structure of our collection and then create it, with Milvus it is very simple

num_entities, dim = 3000, 8

fields = [
    FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
    FieldSchema(name="animals_race", dtype=DataType.VARCHAR, description="race of animal", max_length=100),
    FieldSchema(name="image_embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim, description="image plot embedding")
]
schema = CollectionSchema(fields, "My image collection")

if utility.has_collection("hello_milvus"):
    utility.drop_collection("hello_milvus")

hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")

Now we are going to generate random values to simulate embedding, usually this process takes place through a model. The Towhee library offers many.

animal = ["Dog", "Cat", "Bird", "Lion", "Parrot"]
entities = [
    [str(i) for i in range(num_entities)],
    [str(i) for i in range(num_entities)],
    rng.random((num_entities, dim))
]

We can then proceed to the insertion of what generated in our collection.

insert_result = hello_milvus.insert(entities)
hello_milvus.flush()
print(f"Number of entities in Milvus: {hello_milvus.num_entities}") #3000

Done, simple right? Now we can start doing some queries on the data we have just entered, once again we will use a manual embedding model. First, however, it is necessary to create an index on the field that represents our embedding.


index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

hello_milvus.create_index("image_embeddings", index)

We have therefore defined an index of the “IVF_FLAT” type and we will use the metric “L2” (Euclidean distance) to define the score aimed at indicating when a vector is similar to another. In particular, an “IVF_FLAT” index is based on quantization, and finds a sweet spot between accuracy and query speed, making it ideal for various scenarios.

Prior to commencing any search or query operations, it’s crucial to take the necessary step of loading the data residing within the hello_milvus collection into the memory for efficient processing and access

hello_milvus.load()

Now, assuming we are looking for a photo that we have previously entered, we will take a vector from those entered and use it as a query vector to submit to the database

vectors_to_search = entities[-1][-1:]

search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10},
}

result = hello_milvus.search(data=vectors_to_search, 
                             anns_field="image_embeddings", 
                             param=search_params, 
                             limit=3, 
                             output_fields=["animals_race"])

for hits in result:
    for hit in hits:
        print(f"ID: {hit.id} - DISTANCE: {hit.distance} - ENTITY INFO: {hit.entity}")

As you can see from the result, several fields are returned. The most important is certainly the distance score, which allows us to return the most similar vectors, given the metric set previously, and thus be able to provide a form of sorting. In particular, with the parameter “nprobe” we indicate how specific we want to be in our search. In fact, it identifies the number of clusters we want to compare with our vector once the query has been placed. We can look at it this way: the lower the value of nprobe, the faster but potentially less detailed and less complete the response to the query will be; conversely, the higher the nprobe, the more complete the result will be, but the times will increase.

Milvus, in addition to querying for vector-related metrics, also allows “relational” logic to be set in the query setting. We can therefore request that the input vector only be compared with vectors whose value for a certain field has a certain valuation.

vectors_to_search = entities[-1][-1:]

search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10},
}

result = hello_milvus.search(data=vectors_to_search, 
                             anns_field="image_embeddings", 
                             param=search_params, 
                             limit=3, 
                             expr="animals_race==\'Dog\'",
                             output_fields=["animals_race"])

for hits in result:
    for hit in hits:
        print(f"ID: {hit.id} - DISTANCE: {hit.distance} - ENTITY INFO: {hit.entity}")

In this way, we limit the possible comparisons that will be made in the query phase. It works great, doesn’t it?

A final feature worth presenting is provided by Attu — Attu stands as an exceptionally efficient open-source management tool designed specifically for Milvus. Its intuitive graphical user interface (GUI) streamlines the interaction with databases, offering a seamless experience. With its user-friendly interface, you can effortlessly visualize the current status of your clusters, manage metadata with ease, execute complex data queries, and harness a multitude of features. Whether it’s navigating through cluster configurations or delving into data visualization, Attu simplifies these tasks with just a few clicks, empowering users to explore and manage their Milvus databases effortlessly.

Then, following the official guide you can find here, we install Attu. First, you need to retrieve the IP address of your machine, to do this simply use the ipconfig command from your terminal.

We can now run the following command using the ip we just obtained.

docker run -p 8000:3000 -e MILVUS_URL=192.168.1.16:19530 zilliz/attu:v2.3.1

Once you are notified on screen that the Attu server has started, you can access the GUI at http://localhost:8000 and you should receive the following page

Now, if like me, you have not set any user or password to the Milvus server by clicking on ‘“connect” you will access the interface

From here you can select a collection, preview its contents and data, navigate any partitions, perform queries, check segments, plus a whole series of very useful operations and views that will allow you to navigate your collections in comfort.

Conclusion

In summary, vector databases prove their worth through their proficiency in executing similarity searches, scalability, high-speed performance, efficient management of high-dimensional data, and versatility in accommodating different data models. In addition, their ease of use makes them compact and user-friendly systems.

These attributes make them invaluable across diverse fields and applications.

Now you just have to challenge yourself by perhaps creating an image search engine or a film recommendation system. You can take inspiration from the bootcamp made available by Milvus, which, as we have experienced, is an excellent operating source tool for getting to grips with the potential of this type of non-relational database.

Besides Milvus, there are other noteworthy vector databases, each with its own peculiarities. The main ones are Chroma, Pinecone, Weaviate, MongoDB and others. A more complete analysis of the benefits can be found in this article published here on Medium.

In any case, the advent of unstructured data and the progress of ever more advanced AI models will see the increasing use of non-relational databases, among which we could therefore find, vector databases.