Understanding Milvus: Key Concepts and Potential Applications

Published in

n11 Tech

15 min readSep 2, 2024

Milvus is an advanced open-source vector database designed for handling complex data operations, particularly those involving unstructured data. This article delves into the underlying architecture of Milvus, exploring its key components such as the Access Layer, Coordinator Service, Worker Nodes. We will also examine the algorithms Milvus employs, including ANNOY and HNSW.

In the context of n11, we will discuss how Milvus can be utilized for specific scenarios, such as category prediction. By leveraging Milvus’s robust capabilities, n11 can enhance its data processing and analysis strategies, driving more accurate and scalable solutions.

Key Concepts

In this section, we’ll explore the foundational ideas behind Milvus, including the nature of unstructured data, the role of embedding vectors in representing this data, and how vector similarity search is used to find and analyze similar data points.

Unstructured Data

Unstructured data refers to information that doesn’t fit into a traditional, organized database structure. This includes formats like images, videos, audio files, and natural language text, which lack a predefined model or schema. Unstructured data is highly prevalent, making up approximately 80% of the world’s data. To make sense of this data, it can be transformed into vectors using artificial intelligence (AI) and machine learning (ML) models, enabling more advanced analysis and search capabilities.

Embedding Vectors

An embedding vector is a numerical representation of unstructured data, such as emails, IoT sensor data, images, or even protein structures. Essentially, it’s an array of floating-point numbers or binary values that captures the features of this data. By converting unstructured data into embedding vectors, modern techniques make it easier to analyze and compare complex data types in a structured manner.

Vector Similarity Search

Vector similarity search involves comparing a vector to a database of vectors to identify those most similar to the query vector. This process is accelerated by using Approximate Nearest Neighbor (ANN) search algorithms. When two embedding vectors are found to be very similar, it indicates that the original data sources they represent are also closely related.

Why Milvus

Basically Milvus stands out for its high performance, particularly when handling vector searches across massive datasets, ensuring quick and efficient retrieval of relevant data. It’s designed with developers in mind, offering extensive support for multiple programming languages and a robust toolchain that simplifies integration and usage. Milvus is also built for scalability in cloud environments, maintaining high reliability even during system disruptions. Additionally, it excels in hybrid search capabilities by seamlessly combining scalar filtering with vector similarity search, allowing for more precise and versatile data queries.

To achieve this level of precision, Milvus supports various similarity metrics that cater to different types of data and search requirements.

Similarity Metrics

In Milvus, similarity metrics are essential for measuring how closely vectors match each other. By selecting the appropriate metric, you can greatly improve the accuracy of tasks like classification and clustering. Milvus supports a variety of metrics tailored to different types of data. For floating-point embeddings, commonly used metrics include Euclidean distance, Inner product, and Cosine similarity. For binary embeddings, Hamming and Jaccard distances are frequently utilized. Each of these metrics plays a vital role in optimizing vector search performance, ensuring that similar data points are effectively and efficiently identified.

Milvus Architecture

Milvus is designed for similarity search on large-scale dense vector datasets, handling millions to trillions of vectors efficiently. It supports advanced features like data sharding, streaming data ingestion, dynamic schema, and hybrid search, making it highly adaptable to various embedding retrieval scenarios. Milvus adopts a shared-storage architecture with separate layers for access, coordination, computation, and storage, ensuring scalability, availability, and disaster recovery. For optimal performance, deploying Milvus with Kubernetes is recommended.

Milvus follows a disaggregated architecture, with four independent layers that ensure scalability and disaster recovery.

Access Layer

The access layer is the front-end of the system, composed of a group of stateless proxies that serve as the main user interface. It validates client requests and optimizes the results returned to users:

The proxy is stateless and provides a unified service address through load balancing tools like Nginx, Kubernetes Ingress, NodePort, and LVS.
In Milvus’s massively parallel processing (MPP) architecture, the proxy aggregates and post-processes intermediate results before delivering the final output to the client.

Coordinator Service

The coordinator service functions as the system’s brain, assigning tasks to worker nodes and managing cluster operations. It handles cluster topology, load balancing, timestamp generation, and data management.

Root Coordinator (Root Coord): Manages data definition (DDL) and data control (DCL) requests, such as creating or deleting collections, partitions, or indexes, and oversees timestamp management.
Query Coordinator (Query Coord): Manages query node topology, load balancing, and segment handoffs from growing to sealed segments.
Data Coordinator (Data Coord): Oversees data and index node topology, maintains metadata, and triggers background operations like flushing, compaction, and index building.

Worker Nodes

Worker nodes execute tasks as directed by the coordinator service, handling data manipulation language (DML) commands from the proxy. These nodes are stateless, thanks to the separation of storage and computation, which allows for easy scaling and disaster recovery when deployed on Kubernetes. There are three types of worker nodes:

Query Node: Retrieves incremental log data, converts it into growing segments, loads historical data from object storage, and performs hybrid searches between vector and scalar data.
Data Node: Processes mutation requests, retrieves log data, and packages it into log snapshots for storage in the object storage.
Index Node: Builds indexes, which can be managed using a serverless framework without needing to be memory resident.

Storage

Storage is the backbone of the system, ensuring data persistence. It consists of three main components:

Meta Storage: Stores snapshots of metadata, such as collection schemas and message checkpoints, using etcd for high availability and strong consistency. Milvus also relies on etcd for service registration and health checks.
Object Storage: Handles the storage of log snapshots, index files, and intermediate query results. Milvus utilizes MinIO and can be deployed on AWS S3 or Azure Blob. To improve performance and reduce costs, Milvus plans to implement cold-hot data separation using memory or SSD-based caching.
Log Broker: Acts as a pub-sub system for streaming data persistence and event notifications. It ensures data integrity during system recovery, using Pulsar for clusters and RocksDB for standalone setups. The log broker’s pub-sub mechanism, as illustrated, supports system scalability and reliability by maintaining a log sequence that subscribers use to update local data.

Also Milvus operates in two modes: Standalone and Cluster. Both modes offer the same core features, allowing you to choose the one that best suits your dataset size, traffic, and operational needs. The Standalone mode is ideal for smaller-scale applications and includes core components like Milvus, Meta Store, and Object Storage, all working together to ensure efficient data management and persistence. On the other hand, the Cluster mode is designed for larger, more complex deployments, featuring a microservice architecture with components such as Root Coord, Proxy, and Query Node, along with third-party dependencies like etcd for metadata storage, S3 for object storage, and Pulsar or Kafka for log management. While Standalone mode is suitable for simpler setups, Cluster mode is preferred for scalable, distributed environments that require high availability and resilience.

Message Storage / WAL

Milvus employs a Write-Ahead Logging (WAL) mechanism to ensure the secure processing of data. This mechanism records every operation before it is executed, allowing the system to recover by replaying these logs in the event of unexpected failures. For this critical function, Milvus supports two powerful messaging systems: Apache Kafka and Apache Pulsar.

Apache Kafka: Apache Kafka is a widely used distributed streaming platform known for its high throughput, fault tolerance, and scalability. Kafka operates as a publish-subscribe messaging system, where data is written to a log and then consumed by subscribers. In the context of Milvus, Kafka ensures the persistence of streaming data, preventing data loss in the event of a system failure. Kafka’s straightforward and direct architecture makes it a reliable choice for scenarios where message ordering and durability are crucial. Kafka’s advantages include its ability to handle high throughput and large-scale data processing. When working with massive datasets, Kafka enables rapid data processing. However, Kafka’s scalability is typically achieved by adding more brokers, which can complicate management. Additionally, Kafka is primarily a log-based storage system, which limits its use cases mainly to streaming data.
Apache Pulsar: Apache Pulsar, on the other hand, offers a more flexible and powerful solution. Pulsar is a distributed messaging and streaming platform designed for high-performance workloads. One of its standout features is its multi-layered architecture, which separates the serving and storage layers. This design not only enhances scalability but also simplifies cluster management. Pulsar supports dynamic scaling, allowing the system to adapt to changing workloads seamlessly. Pulsar’s scalability is further enhanced by its integration with Apache BookKeeper, a robust management tool that ensures the persistent storage and management of data. This is particularly important when dealing with large datasets and scenarios that require high fault tolerance. Moreover, Pulsar can function both as a message queue and a message stream, making it a versatile option that supports a broader range of messaging patterns.

Milvus supports both Kafka and Pulsar, offering flexibility in choosing the messaging system that best fits operational needs. Kafka is ideal for scenarios prioritizing high throughput and data durability. In contrast, Pulsar stands out with its fine-grained scalability, simplified management, and versatility, making it particularly advantageous for large-scale deployments.

Pulsar’s integration with Apache BookKeeper and its scalable architecture ensures that Milvus can maintain high availability and reliability, even as data volume and query load increase. This flexibility is especially beneficial in enterprise environments like n11, where complex and evolving data requirements must be met efficiently.

In summary, while Kafka and Pulsar are both robust options for managing WAL in Milvus, Pulsar’s advanced features and flexibility make it a compelling choice for optimizing Milvus for diverse and future-proof applications.

Now that we’ve thoroughly explored the architecture and the integral components that form the backbone of Milvus, let’s shift our focus to the algorithms that power its performance. Understanding these algorithms is key to grasping how Milvus efficiently handles the complexity of large-scale, unstructured datasets.

Algorithms

ANNOY!🥱

ANNOY (Approximate Nearest Neighbors Oh Yeah!) is an algorithm designed for efficient nearest neighbor search in high-dimensional spaces. It’s particularly useful when dealing with large datasets where an exact search would be computationally expensive. ANNOY works by building multiple trees, each one splitting the data in a different way, to approximate the nearest neighbors of a given point.

When a query is made, ANNOY traverses these trees to quickly find candidate neighbors. While it doesn’t always guarantee the exact nearest neighbors, it achieves a balance between speed and accuracy, which is critical in large-scale applications. ANNOY can work with various distance metrics, such as Euclidean distance, to determine how close two vectors are in the high-dimensional space. Euclidean distance, for example, measures the straight-line distance between points and is commonly used in tasks that require assessing the similarity between data points.

Now, if we look at the flow diagram you provided, it illustrates a basic approach that mirrors the steps ANNOY takes to find the nearest neighbors:

• Starting Point: Similar to how ANNOY begins with a set of data points, the process starts by selecting an initial point to begin the search.

• Tree Traversal: The diagram shows how the algorithm systematically searches for the nearest unvisited points, akin to how ANNOY navigates through its trees to identify potential neighbors.

• Approximation: Just as ANNOY may not visit every single point but instead focuses on approximating the nearest neighbors quickly, the diagram reflects this by prioritizing certain points while marking others as visited.

By combining the principles shown in the diagram with a distance metric like Euclidean distance, ANNOY is able to efficiently perform approximate nearest neighbor searches, making it a powerful tool in scenarios where performance is critical, such as real-time recommendation systems or large-scale image retrieval in Milvus.

HNSW🌐

HNSW (Hierarchical Navigable Small World) is an advanced algorithm for approximate nearest neighbor search, similar to ANNOY but with key differences. While ANNOY uses random trees to partition data, HNSW builds a multi-layer graph where higher layers provide a coarse overview and lower layers offer finer detail. This hierarchical structure allows HNSW to navigate the search space more intelligently, often resulting in faster and more accurate searches, especially in high-dimensional or complex data distributions.

Multi-Layer Graph Structure: HNSW organizes data into multiple layers, where each layer is a graph connecting points that are close to each other according to some distance metric, such as Euclidean distance. The higher layers have fewer points and provide a broad, less detailed view, whereas the lower layers have more points and provide detailed connections.
Search Process: The search in HNSW starts at the topmost layer, where the algorithm quickly identifies a rough approximation of the nearest neighbors. It then navigates down through the layers, refining the search with each step, until it reaches the bottom layer, where it performs a more precise search to identify the nearest neighbors.
Efficiency: The hierarchical structure allows HNSW to efficiently narrow down the search space by progressively refining the search from a broad overview to a detailed examination. This makes it highly efficient for large-scale datasets where exhaustive search would be computationally prohibitive.

The diagram you provided visually represents the layers in the HNSW algorithm. Here’s how it relates to the HNSW process:

Layer 2 (Top Layer): This is the highest layer with the fewest points, providing a broad overview of the data space. The algorithm begins the search here, identifying the nearest neighbor in this coarse layer. The dashed lines represent the connections between layers, indicating the transition from a broad search to a more detailed one.
Layer 1 (Middle Layer): After identifying the nearest neighbor in the top layer, the algorithm moves to the next layer down. Here, it refines the search by finding the nearest neighbor within a more populated set of points. The process repeats as the algorithm navigates down through the layers.
Layer 0 (Bottom Layer): This is the most detailed layer, where the search is finally narrowed down to find the exact nearest neighbors. The connections in this layer are dense, allowing for a more precise search that considers all possible candidates.

Having explored the fundamental concepts, architecture, and key algorithms that make Milvus a powerful vector database, let’s now turn our attention to how these elements come together in a real-world application. Specifically, we’ll delve into how n11 leverages Milvus for category prediction, ensuring that products are accurately classified based on both their titles and images. We’ll walk through a detailed scenario that illustrates how Milvus processes data, applies advanced algorithms like ANNOY and HNSW, and ultimately helps n11 maintain a high standard of accuracy and consistency in its marketplace.

Using Milvus for Category Prediction in n11: A Detailed Walkthrough

In the n11 platform, category prediction is essential to ensure that products are correctly classified, which significantly impacts the user experience and the accuracy of search results. By leveraging Milvus and its robust vector search capabilities, n11 can automate and enhance this process, particularly in scenarios where product titles and images may not align perfectly. Below, we’ll explore how Milvus processes the data, from ingestion to final categorization, using advanced algorithms like ANNOY and HNSW.

1.Data Ingestion and Vectorization

Text Data (Product Title)

When a seller inputs a product title, the text undergoes preprocessing, including tokenization, stop-word removal, and possibly stemming or lemmatization to distill the text to its core components.
This processed text is then converted into a high-dimensional vector using an embedding model such as Word2Vec or BERT. These models map the text into a vector space where semantically similar phrases are positioned closely.
For instance, a title like “Sunglasses for Women” might be transformed into a 300-dimensional vector, with each dimension representing a specific feature learned from the model’s training data.

Image Data (Product Images)

Uploaded product images are processed using a convolutional neural network (CNN), such as ResNet or VGG, which extracts visual features from the images and generates embedding vectors.
These vectors represent the image in a high-dimensional space, capturing visual aspects like shapes, colors, and textures. Similar images will thus be located near each other in this vector space.

2. Attribute Assignment and Metadata

After vectorization, Milvus assigns attributes to each vector:

Vector ID: A unique identifier assigned to each vector for retrieval purposes.
Category Labels: Initial labels provided by the seller, such as “Sunglasses,” used as metadata.
Timestamp: The time the vector was created, useful for tracking and data management.
Additional Metadata: Other relevant information, like seller ID, product ID, or tags, to facilitate filtering or further categorization.

Here’s an example JSON structure that represents the attribute assignment and metadata for a vector in Milvus:

{
  "vector_id": "1234567890abcdef",
  "attributes": {
    "category_label": "Sunglasses",
    "timestamp": "2024-09-01T10:15:30Z",
    "metadata": {
      "seller_id": "seller_789",
      "product_id": "product_12345",
      "tags": ["fashion", "accessory", "summer"]
    }
  },
  "vector_data": [
    0.132, 0.567, -0.234, 0.789, -0.456, 0.101, 0.923, 0.654, -0.214, 0.783
    // Additional dimensions of the vector...
  ]
}

3. Data Storage in Milvus

After vectorization and attribute assignment, the vectors are ingested into Milvus, where they are indexed for fast retrieval. The process involves several steps within Milvus’s architecture:

Access Layer

The data first passes through the Access Layer, comprising stateless proxies that handle client requests. The proxies validate the data, ensuring it meets format and attribute requirements.
The Access Layer also balances the load across multiple proxies, ensuring efficient processing even with large volumes of data.

Coordinator Service

After validation, the data is managed by the Coordinator Service, which acts as the system’s brain, distributing data across worker nodes.
The Root Coordinator manages metadata and assigns data to appropriate partitions, triggering necessary background tasks like index building.

Worker Nodes

These nodes are responsible for the actual data processing:

Data Nodes: Handle the ingestion of vectors, ensuring they are stored correctly and creating log snapshots for persistence.
Index Nodes: Index the data using algorithms like ANNOY or HNSW:

ANNOY creates random projection trees that partition the vector space, facilitating fast approximate nearest neighbor searches.
HNSW builds a hierarchical graph, where each vector connects to its nearest neighbors across multiple layers, allowing more accurate searches by refining results through the graph layers.

Query Nodes: Manage search queries, retrieving indexed vectors and performing similarity searches to find the closest matches.

Final Storage

Indexed vectors are stored in Milvus’s Object Storage, backed by systems like MinIO, AWS S3, or Azure Blob Storage, ensuring data persistence. Metadata management is handled by the Meta Storage, using etcd for high availability and consistency.

4. Data Retrieval and Similarity Search

When a new product is listed, Milvus processes the query through the following steps:

Query Processing: The query is received by the Access Layer and forwarded to a Query Node.
Index Traversal: The Query Node uses the indexes built by ANNOY or HNSW to navigate the vector space and identify nearest neighbors:

ANNOY provides fast approximate results using its random trees.
HNSW refines the search through hierarchical layers, offering more accurate results.

Result Compilation: The closest matches are compiled and returned for further processing.

5. Final Categorization and Action

Cross-Validation: If the image vectors and title vector suggest different categories (e.g., the title suggests “Sunglasses,” but the image resembles “Women’s Clothing”), the system flags this as a mismatch. The seller may be prompted to correct the images or title.

Blocking Nonexistent Products: For titles indicating nonexistent products (e.g., “iPhone 16”), Milvus can block the listing if no similar products exist in the database or if the title refers to a product not yet available.

Final Decision: Based on the similarity search and cross-validation, the product is either categorized correctly, flagged for review, or blocked if it doesn’t meet the platform’s standards.

Conclusion

Milvus, with its advanced data processing and indexing capabilities, empowers n11 to effectively handle large volumes of unstructured data. By leveraging powerful algorithms like ANNOY and HNSW, Milvus ensures that similarity searches are not only fast but also highly accurate, enabling precise categorization of products. This precision is crucial for n11, as it directly contributes to improving the accuracy of product listings, minimizing the risk of misclassification, and maintaining the overall integrity of the marketplace.

Incorporating Milvus into our processes allows us to stay ahead in managing complex data challenges, ensuring that our platform continues to offer a reliable and seamless experience for both sellers and buyers. As we continue to refine and expand our use of Milvus, we anticipate even greater efficiency and accuracy in how we categorize and present products, ultimately enhancing the quality and trustworthiness of the n11 marketplace.

Thank you for taking the time to explore how we are utilizing Milvus to optimize our operations at n11. We’re excited to continue innovating and providing the best possible experience for our users, and we look forward to sharing more of our journey with you in the future.