Unstructured Data Service

Metadata Management in the Milvus Vector Database (3)

How to Manage Data Files with Metadata

Milvus
Vector Database for AI

--

After learning the metadata of Milvus, we can now see how the metadata is used. Take SQLite as an example.

Create a table

Use the Python API to create a table:

milvus.create_table({
'table_name': 'table_1',
'dimension': 512,
'index_file_size': 1000,
'metric_type': MetricType.L2
})

Milvus immediately adds a 512-dimensional row in Tables, with index_file_size as 1048576000, which equals 1000*1024*1024, and metric_type as 1. TableFiles is still empty.

The new row is added by SQL:

INSERT INTO Tables VALUES(1, 'table_1', 0, 512, 1576306272821064, 2, 1048576000, 1, 16384, 1, , , '0.6.0');

Use SQLite client to check Tables. We can see that engine_type and nlist use default values:

Insert vectors

We can insert some vectors in the table:

milvus.insert(table_name='table_1', records=vec_list, ids=vec_ids)

Assume that we insert 10,000 vectors per batch. Totally, 1 million 512-dimensional vectors are inserted. If you check TableFiles during insertion, you can see that new entries are continuously generated to replace old entries. This is because the thread to combine files keeps combining smaller files with larger files. Smaller files are removed in the process.

If you check TableFiles after all vectors are inserted, probably two files remain.

From the row_count field we can see that the first file has 530,000 vectors and the second file has 470,000 vectors. This is because during the combination, when the first file is more than 1048576000 bytes, it will no longer combine new files. The remaining vectors are combined to the second file with the final size 966320113 bytes, which is less than index_file_size. So, if new vectors are being inserted, this file can still be combined.

Milvus uses SQL to operate TableFiles:

INSERT INTO TableFiles VALUES(...);
DELETE FROM TableFiles WHERE ...;

You can also find the files in the Milvus data folder. Here, the data folder is /tmp/milvus. Each table has a independent folder. The following two files are located in /tmp/milvus/db/tables/table_1:

Query the number of vectors

The client uses count_table to query the number of vectors in a table:

milvus.count_table(table_name='table_1')

In Milvus, a SQL query is executed:

SELECT SUM(row_count) FROM TableFiles where table_id = 'table_1' AND file_type IN (1, 2, 3);

It is easy to tell what this query does if you have some SQL background. In TableFiles, all values of the row_count field that match the condition are summed up to calculate the number of vectors in a table. The condition includes: table name must be table_1; file state must be 1, 2, or 3, i.e. only raw vector files, files to build indexes for, and files with built indexes are calculated. If the file state is 4 (soft delete) or 7 (backup), it is not calculated.

Conduct vector similarity search

The client uses search to conduct vector similarity search:

milvus.search(table_name='table_1', query_records=query_vectors, top_k=100, nprobe=32)

In the Milvus vector database, a SQL query is executed to get the file that needs to be searched:

SELECT * FROM TableFiles WHERE table_id = 'table_1' AND file_type IN (1, 2, 3);

In this way, you can acquire all information of file to be searched. The Milvus vector database only searches files with state 1, 2, and 3. Then, Milvus finds the file locations via file_id. Afterwards, the search scheduler loads the file one by one into CPU memory or GPU memory for further computation.

Build indexes

The client uses create_index to build indexes. The following code builds an SQ8 index with nlist as 5000:

milvus.create_index(table_name='table_1', {'index_type': IndexType.IVF_SQ8, 'nlist': 5000})

When you check the table state, you can notice the following changes:

The target index type and nlist are updated because of the following operation:

UPDATE TableFiles SET file_type = 2 WHERE table_id = 'table_1' AND file_type = 1;

Until now, create_index waits until indexes are built for all files. Milvus continuously checks whether new raw data files are generated. If so, the file_type is set to 2 (to build indexes). The scheduler creates tasks for files with file_type as 2 and build indexes one by one. create_index returns a value when indexes are built for all files.

When index building is complete, a new index file is generated. The raw data files are marked as backup (file_type as 7) in order to switch to other index types.

We can see that two files are added. The row_count of each new file is the same as one the previous raw data files. However, file_size is much smaller because SQ8 compresses data. We can learn the differences from engine_type and file_type.

Remove indexes

The client uses drop_index to remove indexes:

milvus.drop_index(table_name='table_1')

In the Milvus vector database, some operations are performed when removing indexes. First, the index type is switched to 1 (FLAT). Then, file_type of the index file is set to 4 (soft delete). Meanwhile, file_type of the backup file is set to 1 (raw data file):

UPDATE Tables SET engine_type = 1 WHERE table_id = 'table_1';
UPDATE TableFiles SET file_type = 4 WHERE table_id = 'table_1' AND file_type = 3;
UPDATE TableFiles SET file_type = 1 WHERE table_id = 'table_1' AND file_type = 7;

The thread responsible for cleaning data gets the information of the file to be removed and physically removes the file from the hard disk. Then, the entry of the index file is removed from TableFiles.

DELETE FROM TableFiles WHERE table_id = 'table_1' AND file_type = 4;

Remove a table

The client uses drop_table to remove a table:

milvus.drop_table(table_name='table_1')

Milvus sets the state of a table to 1 (soft delete) and file_type of all files in the table t 4 (soft delete):

DELETE FROM TableFiles WHERE table_id = 'table_1' AND file_type = 4;
DELETE FROM Tables WHERE state = 1;

Summary

We can learn that the Milvus vector database uses metadata by changing the state of some entries and perform operations (build indexes, remove) accordingly. The implementation includes some techniques such as avoiding some problems by the transaction mechanism of an OLTP database, rolling back operations when errors occur in the process, etc. The Milvus vector database defines interfaces for metadata management. Apart from SQL databases, you can even use NoSQL databases for metadata management.

--

--

Milvus
Vector Database for AI

Open-source Vector Database Powering AI Applications. #SimilaritySearch #Embeddings #MachineLearning