Why Vectors Should Be Stored Together with Knowledge Graph?

Fanghua (Joshua) Yu
8 min readOct 11, 2023

--

An architecture review of Neo4j as both vector and graph store

The dome of Melbourne Central that covers Coop’s Shot Tower. Photograph by the author.

Abstract

Choosing a storage technology for vectors, i.e. embeddings of text, image and video produced by machine learning models, should take architecture principles into considerations, so as to make solutions based on them ready for productionization.

The GenAI Stack

With the interest in GenAI growing and new innovations emerging everyday, several days ago, Neo4j announced the GenAI Stack — a pre-built development environment for creating GenAI applications, partnering with Docker, LangChain, and Ollama.

Source: https://neo4j.com/emil/introducing-genai-stack-developers/

Considered to be a graph database for years, now Neo4j has further strengthened its position in this GenAI trend, after its previous annoucement of adding vector index support in its native graph database platform.

Given there are already tens of vector storing and indexing technologies in the market, plus more and more traditional DBMSs have added support to vectors, this article will go through key architecture considerations for evaluating a suitable vector store.

Embeddings Are Enterprise Data

It wouldn’t be supprising to see vectors, e.g. those generated by text embeddings, become part of enterprise data together with the rise of GenAI powered solutions, so it is necessary to review what it actually is, and what enterprise capabilities are relevant.

Vector is simply an array of float numbers. For a more comprehensive explanation of what vector / embedding is, below is a post for your reference:

As per Llama Index, a conceptual architecture for Retrieval Augmented Generation (RAG) solution looks like this, which takes advantage of embeddings for semantic search:

Source: LlamaIndex

Here, various storage technologies are used for documents, key-value pairs and vectors, as well as indices. Apparently, this architecture will require extra deployment, integration and operational support.

Neo4j’s OpenAI Stack puts all data in its graph database, and leverages its native indexing for literal value, free style text (TF-IDF style) and vectors.

Neo4j’s Native Graph Store is capable of processing many data types.

Apparently, this is a much simplified architecture for running in real production environment. In fact, both structured and unstructured data (e.g. a document) can be easily stored and queried in a graph database.

Below, let’s go through key architecture considerations of Neo4j Graph DB used for vector storage and search.

Choosing Vector DB: The Architecture Considerations

1. Storage Schema

Neo4j is often described as a “schema-lite” (or “schema-optional”) database, which contrasts with the rigid schema structures often found in relational databases. Let me explain.

1 ) Flexibility in Data Ingestion: In Neo4j, you can start ingesting data without predefining a schema. This allows developers to immediately start adding nodes and relationships without the need to design and enforce a strict schema ahead of time.

2 ) Adaptable to Changes: As applications evolve, data requirements can change. Being schema-lite ensures that the database can easily adapt to these changes without requiring extensive modifications or downtime.

3 ) Implicit Schema: While Neo4j doesn’t enforce a strict schema, it inherently possesses a form of schema through its data. Nodes have labels, and relationships have types. Properties can be added to nodes and relationships, allowing for a form of structure and categorization. This provides a balance between structure and flexibility.

4 ) Schema Constraints: While Neo4j offers the flexibility of a schema-lite approach, it also provides tools to enforce data integrity when needed. Users can define unique constraints and existence constraints on properties, ensuring data consistency when required.

5 ) Ideal for Varied and Evolving Data: In situations where data is diverse and can evolve over time (such as with social networks or interconnected datasets), a schema-lite approach allows for the natural growth and evolution of data without the constraints of a fixed schema.

Graphs are inherently more flexible and adaptable structures compared to tabular data, and they can more naturally represent complex, interconnected data without a predefined schema.

2. Scalability

The database should be able to handle growth in data volume and user load. When throughput demand for the same dataset rises, Neo4j enables easy scale-out with Autonomous Clustering.

This architecture automatically allocates copies to the optimal servers based on default business rules or specified operational requirements. It allows simple horizontal scalability. Through the Composite Database feature, it is also possible to query vectors across multiple databases without having to load them into one place first.

Neo4j Autonomous Clustering. Source: https://neo4j.com/product/neo4j-graph-database/scalability/

3. Performance

The current implementation of Vector Index is based on Lucene 9’s HNSW style index, an approach using approximate nearest vector search. As per my own test on PC workstation hardware without using GPU, for a given vector, it can achieve 20~50ms (per request per CPU core) to search for the top 50 most similar vectors from 1 million vectors in database, each of which has 960 dimensions.

4. Availability and Reliability

High availability ensures the database remains operational even during failures. Neo4j’s clustering provides these main features:

1 ) Safety: Servers hosting databases in primary mode provide a fault tolerant platform for transaction processing which remains available while a simple majority of those Primary Servers are functioning.

2 ) Scale: Servers hosting databases in secondary mode provide a massively scalable platform for graph queries that enables very large graph workloads to be executed in a widely distributed topology.

3 ) Causal consistency: When invoked, a client application is guaranteed to read at least its own writes.

4 ) Operability: Database management is separated from server management.

Neo4j Database Cluster. Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/

5. Data Consistency

For data scientists, ACID may sound strange, but it is a quite common concept in the database developer world to adress Data Consistency. ACID stands for Atomicity, Consistency, Isolation and Durability.

As a simple analogy, let’s use money transfer to explain ACID:

  • Atomicity ensures once transfer is done, the money is deducted from your account, and credited to the destination account. If any exception occurs, nothing should happen.
  • Consistency ensures both accounts should reflect the correct amounts after your transaction.
  • Isolation guarantees only after transfer is done, you should see the balance changed.
  • Durability: Once your money has been transferred, that change is permanent.

Neo4j is a fully ACID-compliant database, as a result saving and updating vectors stored as Node and/or Relationship property values always comply with the ACID principles above.

6. Security

Consider encryption (both at rest and in transit), access controls, audit capabilities, and other security features. Ensure the database complies with regulatory and organizational security standards.

In my previous post, I explained how to apply Role-Based Access Control over the results of search over the vector index. Below is the link:

By the look of it, vector is just a collection of decimal numbers which can’t be understood by human beings anyway, so it may seem ok to store it just like that. However, according to this paper, attacks on popular sentence embeddings recover between 50%–70% of the input words (F1 scores of 0.5–0.7). Second, embeddings may reveal sensitive attributes inherent in inputs and independent of the underlying semantic task at hand. Embeddings are crackable! This has put security and confidentiality back on the top of the list of criteria when considering a vector store for potential sensitive data.

7. Query Language Features

Vector store should at least provide an API interface for regular CRUD operations. A feature-rich, declarative query language would help reduce learning and development cycle greatly.

Neo4j’s Cypher is a declarative, graph query language specifically designed for querying and manipulating graph data in the Neo4j database. If you are not familiar with Cypher, there are simple samples to start with from the post below:

Neo4j DBMS itself exposes RESTful APIs. There is also the GRANDStack, a full-stack framework for building applications on graph database.

GraphQL provides more flexible way to query Neo4j graph database. Source: https://neo4j.com/blog/grandstack-graphs-way-down-nodes/

8. Ecosystem and Integration

Ensure the database integrates well with your existing tech stack and tools. Consider backup solutions, monitoring tools, and third-party integrations. Relevant valuation criteria of a vector DB are to see how it manages updates on vectors, maintains version history and integrates with existing data pipeline tools e.g. ETL and messaging middleware.

9. Cost Efficiency

Factor in both the immediate costs (licenses, hardware) and ongoing costs (maintenance, scaling, support). As vector index is a native feature of the Neo4j Graph Database Enterprise and Aura editions, there is no extra cost from license perspective.

10. Operational Complexity

Evaluate the skill set required to operate the database, its maintenance overhead, and the ease of finding expertise in the market.

Backup and Recovery: The database should have robust backup mechanisms, and it should be straightforward to recover data in case of failures.

Migration and Portability: Consider how easy it would be to migrate data in and out of the database or switch to a different solution in the future.

11. Other Aspects to Consider

Community and Documentation: For open-source databases, a strong community can be invaluable. Adequate documentation helps in smoother adoption and troubleshooting.

Regulatory Compliance: For regulated industry, ensure that the database can help you meet necessary regulatory compliance standards. For cloud based data stores, SOC 2 Type II is a must-have.

Conclusions

The powerful combination of graphs and LLMs is going to make both technologies more accessible and valuable to enterprises and governments. Storing vectors in Neo4j graph database, and leveraging native vector search as part of core capability have been proven a promising solution as it combines the implicit relationships uncovered by vectors with the explicit and factual relationships and patterns illuminated by graphs.

--

--

Fanghua (Joshua) Yu

I believe our lives become more meaningful when we are connected, so is data. Happy to connect and share: https://www.linkedin.com/in/joshuayu/