Vectorizing JSON Data with Milvus for Similarity Search

8 min readMay 22, 2024

JSON, short for JavaScript Object Notation, is a text-based format for data storage and exchange between servers and web applications. Due to its simplicity, flexibility, and compatibility, developers use JSON data across various industries and applications. For instance, IoT (Internet of Things) devices and sensors seamlessly communicate via JSON with web interfaces.

However, while JSON’s hierarchical structure is useful, it can be a pain to work with for storage, retrieval, and analytics. Vectorizing JSON transforms this data into a format optimized for efficient processing, storage, retrieval, and analysis, helping with performance and usability.

This article explores how the Milvus vector database streamlines JSON data vectorization, ingestion, and similarity retrieval. Furthermore, we offer a guide detailing the step-by-step process for vectorizing, ingesting, and retrieving JSON data using Milvus.

How Milvus Streamlines JSON Data Vectorization and Retrieval

Milvus is a highly scalable, open-source vector database that manages massive volumes of high-dimensional vector data. It benefits use cases such as retrieval augmented generation (RAG), semantic search, and recommender systems. Here’s how Milvus facilitates efficient JSON data processing and retrieving.

JSON data support with dynamic schema

Milvus supports seamless storage and querying of JSON data alongside vector data within users’ collections. With this capability, users can efficiently insert JSON data in bulk and perform advanced querying and filtering based on the value in JSON fields. This capability is essential for detailed data analysis and manipulation in applications that require dynamic schema changes.

Integration with mainstream embedding models

Milvus integrates with mainstream embedding models through PyMilvus, the Python SDK for Milvus, including OpenAI Embedding API, sentence-transformer, BM25, Splade, BGE-M3, and VoyageAI. This integration streamlines vector data preparation and reduces the complexity of the whole data pipeline without introducing additional data stacks.

Embedding FunctionTypeAPI or Open-sourced OpenAIDenseAPI sentence-transformerDenseOpen-sourced BM25SparseOpen-sourced SpladeSparseOpen-sourced BGE-M3HybridOpen-sourced VoyageAIDenseAPI

How to Use Milvus for Embeddings Generation and Similarity Searches

In the following section, we’ll walk you through generating vector embeddings with Milvus by integrating with popular embedding models and conducting a similarity search across the JSON data. For the full code of this step-by-step guide, see the notebook here.

Step 1: Import the JSON data

import json

# Path to the JSON file
json_file_path = 'data.json'

# Load JSON data
with open(json_file_path, 'r') as file:
    articles = json.load(file)

# Display the data to ensure it's loaded correctly
print(articles)

Import JSON Library: This line includes Python’s built-in JSON library to work with JSON formatted data.
Set JSON File Path: It sets the path to your JSON file, assuming it’s named ‘data.json’.
Load and Print JSON Data: Open the JSON file in read mode, load the data into an article variable, and print it out to verify it’s loaded correctly. The with statement ensures the file is automatically closed after the data is read, making this approach both efficient and safe.

The data is in the following format:

{
        "title": "The Impact of Machine Learning in Modern Healthcare",
        "content": "Machine learning is revolutionizing the healthcare industry by improving diagnostics and patient care. Techniques such as predictive analytics are being used to forecast patient outcomes, enhance personalized treatment plans, and streamline operations.",
        "metadata": {
            "author": "Jane Doe",
            "views": 5000,
            "publication": "HealthTech Weekly",
            "claps": 150,
            "responses": 20,
            "tags": ["machine learning", "healthcare", "predictive analytics"],
            "reading_time": 8
        }
    },...

Step 2: Generating Embeddings with PyMilvus

Milvus provides integrations with many popular embedding models through PyMilvus, streamlining the development process. In this example, we will use SentenceTransformerEmbeddingFunction, an all-MiniLM-L6-v2 sentence transformer model.

To use this embedding function, we must first install the PyMilvus client library with the model subpackage, which wraps all the utilities for vector generation.

pip  install “milvus[model]”from pymilvus.model.dense import SentenceTransformerEmbeddingFunction
# Get the model from the model library
model = SentenceTransformerEmbeddingFunction('all-MiniLM-L6-v2')
# Extract features from the preprocessed text
for article in articles:
  # Generate embeddings for the title and content
  article['title_vector'] = model([article['title']])[0]
  article['content_vector'] = model([article['content']])[0]
# Display the article with added vector data
print({key: articles[0][key] for key in articles[0].keys()})

We will only focus on vectorizing “title” and “content” to keep things simple. Using the code above, we have vectorized two fields in our data and saved them as separate fields. Here’s a breakdown of the steps:

model = SentenceTransformerEmbeddingFunction('all-MiniLM-L6-v2'): This line loads the 'all-MiniLM-L6-v2' model, a pre-trained model using PyMilvus model library. This model is known for generating dense vector embeddings for text inputs.
article['title_vector'] = model.encode(article['title']) and article['content_vector'] = model.encode(article['content']): These lines apply the loaded Sentence Transformer model to encode the title and content of each article into vector embeddings. The encode function converts the textual information into a high-dimensional space where semantically similar text vectors are closer. This transformation is crucial for many machine learning applications like semantic search, clustering, and information retrieval, which rely on understanding the underlying semantics of the text.
print({key: articles[0][key] for key in articles[0].keys()}): This final line prints all key-value pairs of the first article, including the newly added vectors for the title and content.
The output provides a straightforward visualization of how textual data is augmented with numerical representations, demonstrating the enriched data structure for data-driven applications.

Step 3: Setting up Milvus

Now, we use the Milvus vector database to manage vector embeddings derived from textual data.

from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection, utility

# Connect to Milvus
connections.connect(alias="default", host="localhost", port="19530")

# Define fields for the collection schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title_vector", dtype=DataType.FLOAT_VECTOR, dim=384),
    FieldSchema(name="content_vector", dtype=DataType.FLOAT_VECTOR, dim=384)
]

# Create a collection schema
schema = CollectionSchema(fields, description="Article Embeddings Collection")

# Create the collection in Milvus
collection_name = "articles"
if not utility.has_collection(collection_name):
    collection = Collection(name=collection_name, schema=schema)
    print(f"Collection '{collection_name}' created.")
else:
    collection = Collection(name=collection_name)
    print(f"Collection '{collection_name}' already exists.")

Let’s dive deeper:

connections.connect(alias="default", host="localhost", port="19530"): This line establishes a connection to a Milvus server running on the local machine (localhost) at port 19530. The alias="default" parameter means this connection is the default connection in subsequent operations. This step establishes communication between the application and the vector database, allowing for operations like data insertion, querying, and management.
FieldSchema and CollectionSchema: The fields for storing data in Milvus are defined using FieldSchema. Each field has a specific role:
id: An integer field configured as the primary key and set to automatically assign unique identifiers to each entry.
title_vector and content_vector: These fields store floating-point vectors (FLOAT_VECTOR), representing article title and content embeddings, respectively. The dim=384 specifies these vectors' dimensionality, matching the embedding model's output size.
These field definitions are grouped into a CollectionSchema, which essentially describes the structure and types of data the collection will hold, including a human-readable description.
collection_name = "articles": This sets a name for the collection.

Step 4: Inserting the Data

# Prepare the data for insertion
entities = [{
    "title_vector": article['title_vector'],
    "content_vector": article['content_vector']
} for article in articles]

# Insert the data into the collection
insert_result = collection.insert(entities)
print(f"Data inserted, number of rows: {len(insert_result.primary_keys)}")

The list comprehension entities = [{"title_vector": article['title_vector'], "content_vector": article['content_vector']} for article in articles] constructs a list of dictionaries, where each dictionary represents an entity to be inserted into the Milvus collection. These vectors are prepared in a format that matches the fields defined in the Milvus collection schema (title_vector and content_vector).
The insert_result = collection.insert(entities) line inserts the prepared entities into the collection. This operation is crucial for populating the Milvus database with vector data for various retrieval tasks, such as similarity searches or machine learning model inputs.
print(f"Data inserted, number of rows: {len(insert_result.primary_keys)}"): After the data insertion, this line prints out the number of successful insertion rows (entities). The insert_result.primary_keys provides the unique identifiers for each inserted record, reflecting how many entries have been added. This feedback is important for verifying that the data has been correctly and fully stored in the collection.

Step 5: Creating Indexes

Indexing is critical in any database management system as it directly affects the performance of search queries. For AI applications, where real-time analysis and responsiveness are crucial, efficient indexing can significantly enhance the user experience.

# Define index parameters for title_vector and content_vector
index_params = {
    "index_type": "IVF_FLAT",  # Inverted File System, suitable for L2 and IP distances
    "metric_type": "L2",       # Euclidean distance
    "params": {"nlist": 100}   # Number of clusters
}

# Create an index on the 'title_vector' field
collection.create_index(field_name="title_vector", index_params=index_params)
print("Index created on 'title_vector'.")

# Create an index on the 'content_vector' field
collection.create_index(field_name="content_vector", index_params=index_params)
print("Index created on 'content_vector'.")

# Load the collection into memory for searching
collection.load()
print("Collection loaded into memory.")

By clustering the vector space, IVF_FLAT indexing reduces the search space for queries, drastically improving search speeds, especially as the dataset grows. Ultimately, we load the collection into memory to increase operational efficiency. Using an IVF_FLAT index type, we partition the vector space into 100 clusters and use the L2 metric for Euclidean distance calculations, which enhances search efficiency and accuracy. Then, we create indexes on both the title_vector and content_vector fields to speed up retrieval tasks, with confirmatory feedback provided after each creation to ensure successful setup.

Step 6: Performing Similarity Search

# Define search parameters
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10}
}

# Assume using the first article's content vector for the query
query_vector = [entities[0]["content_vector"]]

# Execute the search on the 'content_vector' field
search_results = collection.search(
    data=query_vector,
    anns_field="content_vector",
    param=search_params,
    limit=5,
    output_fields=["id"]
)

# Print search results
for hits in search_results:
    for hit in hits:
        print(f"Hit: {hit}, ID: {hit.id}")

This process begins by defining search parameters in the search_params dictionary, utilizing the L2 metric to measure Euclidean distances between vectors, with nprobe set to 10 to ensure a balance between search speed and accuracy. A query vector is prepared using the first article's content_vector, which is aimed at finding articles with similar content within the collection. The search uses these parameters on the content_vector field, limiting results to the top 5 closest matches and returning their IDs.

Finally, the search results are iterated by printing each match with its ID to demonstrate the query’s effectiveness by identifying the articles most similar to the initial query based on their vectorized content representations.

Conclusion

Vector databases optimize vectorizing and querying JSON data, enabling similarity searches, complex pattern recognition, and relational queries beyond the constraints of traditional databases. Milvus vector database supports JSON data storage and retrieval alongside vector data and integrates with mainstream embedding models for vector generation. This integration helps developers easily vectorize their JSON data without adding additional data stacks, significantly streamlining the development process.

Additionally, sparsity occurs in JSON data when not all keys are included in every document, leading to vectors with many zero values indicating missing or null information. This creates a complex high-dimensional vector space, making processing and querying challenging. Milvus’s integration with embedding models optimizes data structures and algorithms to efficiently manage sparse, high-dimensional data. For example, it employs dynamic schemas capable of adapting to diverse data sizes and types, enabling effective storage and querying of intricate JSON structures with minimal preprocessing requirements.