How I created a Neo4j Search Engine with Generative AI

Thcookieh
8 min readMar 13, 2024

--

In the last couple of weeks I had been working on a POC where we can demonstrate the use of Neo4j with Generative AI. The initial project idea was to elaborate a search engine for code, in order to find required documentation and use it in cloud projects.

Why should I use Graphs with Generative AI?

Graphs are the way to represent data to which the relationships among concepts have information. Like this: ( person )-[ works_at ]->( place )

A Person and a Place are entities that are related with a concept: such as works at. The way we represent data with Neo4j is a much better way than doing joins on SQL. Adding the fact that graphs make you use graph theory, and you would create richer information because of the connection of structured data. Also, algorithms such as “shortest-path”, or “page-ranking” can be performed over the data.

Page ranking is only one of plenty ways graphs can be used. Google uses it to rank pages on its engine.

Talking about Generative AI; LLMs are really good at doing language tasks, such as: Translation, Explanation, Comparison, Combination of ideas, Abstraction, etc. In addition to this, the use of Embeddings, that creates a numeric representation of data, enables you to calculate all of these in an easy way.

I already talked about why you should always combine LLMs with embeddings and graphs. You can take a look down here.

Plan of action

In this case the POC had this Plan:

  1. Find a dataset that can be used.
  2. Ingest the dataset in a Neo4j format (Semantic if possible).
  3. Select a node to do the semantic search.
  4. Create an embedding representation of the node.
  5. Add the embedded representation to the node.
  6. Create a Vector index on the embedded value.
  7. Embed a query and compare with the index.
  8. Translate the result.

The process was the following:

1. Find a dataset that can be used.

Neo4j free instance was run in their official website to which we can use the free tier for this POC.

Inside it, there are guides to learn how to use it. I found perfect for this experiment the use of Social Network Analysis which is a database that has Stack overflow Q&A + users and tags.

2. Ingest the dataset in a Neo4j format (Semantic if possible).

I load it up to my system and here is how the schema resulted.

This already has a very good distribution of data, so it doesn’t require any additional data transformation to make it work.

3. Select a node to do the semantic search.

For this purpose, I choose “Answer” node to search semantically among it. The data presents answers to questions related to Neo4j, to which I can find the ones that are flagged as “is accepted” to find the ones that resolve the issue.

4. Create an embedding representation of the node.

For this specific part, I had to use my own resources to access an embedding model, I found some that are free in hugging face, but for Practicity I used OpenAI and a Google colab.

This process was the most tuff to make work, because the free tier in Neo4j didn’t allow me to ingest data from csv’s, neither add the 1024 dimensions selected from a query inside a python script.

I had to do some data processing to convert responses to embeddings into Neo4j. To which, the best solution I found was to create a data frame and later a csv.

5. Add the embedded representation to the node.

Using the new converted embedding model to a csv file, I ingested it in Neo4j import tool.

This allowed me to have a new node inside the database with the emebeddings. I only had to connect them to the Answer via id and then set an embedding value to the Answer node.

Also, doing this made the array to be considered a string in the Answer node, to which I had to do some more preprocessing inside that node.

Results were pretty good at this point; the embedding was now a list of float numbers of dimension 1024.

6. Create a Vector index on the embedded value.

Neo4j documentation mentions how to create a vector index on a value from a node. To which the following code was added.

Cosine similarity allows to compare new embeddings to the vector index and find the most similar ones. This will be useful when doing the search on the engine.

7. Embed a query and compare with the index.

This part was pretty easy, we had only to create a query and embed it to make a test.

The query given is “How to optimize code in Neo4j”.

This is the result:

Sweet, this works perfect. The rest now is up to make it work on a python code, which can be added in a server or product to be consumed.

8. Translate the result.

This last step is to convert the output from the code into an LLM to which it can explain using this context.

Query: “What is a good way to use constraints in Neo4j?”

response from Neo4j vector index
Explanaition from LLM

What is a good way to use constraints in Neo4j?

Constraints are a powerful tool in Neo4j that allow you to ensure data integrity and improve query performance. There are a few different ways to use constraints in Neo4j, depending on your specific needs.

1. Unique Constraints on Two Properties
If you want to create unique constraints on two properties in Neo4j, you can use the CREATE CONSTRAINT syntax. This will ensure that no two nodes have the same combination of values for the specified properties.

For example, if you want to ensure that no two nodes have the same “source” and “target” values, you can use the following syntax: CREATE CONSTRAINT uSourceTarget IF NOT EXISTS ON (m:route) ASSERT (m.source, m.target) IS NODE KEY This will create a unique constraint named “uSourceTarget” on the “source” and “target” properties for all nodes with the “route” label.

2. Adding Constraints to Nodes Based on Property Values
Instead of adding a flag property to nodes, you can use labels to indicate which nodes should have constraints.

For example, if you have an “Entity” label and want to add a constraint to nodes with a “foo” property, you can use the following syntax: CREATE CONSTRAINT ON (ce:ConstrainedEntity) ASSERT

Next steps

We can create for each individual node embeddings and code to query over it.

  • Question — To find other questions similar to mine.
  • Answer — To find a proven solution to my problem.
  • User — To find who is contributing more.
  • Tag — To find relevant categories for my topic.

This will allow us to create tools that a Re-Act agent can take decisions over.

Agent Architecture on LLMs

Let’s say we ask the agent information about a specific user, and the answers they give to other users.

The agent will decide to use the user tool to find the user required and from there you can do anything you want with that data. From connecting to other nodes, expand the query, create reports, do data analysis, etc. The sky is the limit.

Conclusion

Making use of Neo4j allows to connect and get information in an easier way than other databases, also Vector indexing creates an opportunity to make a search engine based on embeddings a breeze and use it as a recommendation engine by comparing vectors.

As a matter of fact, in a previous blog I talked about working with LLMs might get you in some security problems. One of the most frequent problem is the Hallucinations and Over-reliance. Here is a link if you want to know more about this security risks.

The solution proposed mitigates these problems by limiting the model to translate from context. LLMs are good for language task, not fact memorization. This is why I love using RAGs (Retrieval Augmented Generation) on my projects, to which this project is part of.

Using a tool like this, will always create better, efficient, faster delivery of solutions and natural experience to anyone with projects related.

If you want to take a look on the code, you can take a look here.

Thank you so much for reading my post, if you got so far, please consider subscribing to my newsletter, sharing, commenting or leaving a clap to the post. It helps us a lot, and its a constant motivation to continue creating content like this.

We have a lot of things in our hands at the moment, but we love to share content, your interaction is a good reminder that taking a moment to write is helping others and its a well used time. Don’t forget on checking out our social media and our agency if you want us to help you on building your business around AI.

--

--

Thcookieh

R&D | AI Consultant | You cannot compete with someone who loves what he does. It is in his instict. He does not compete. He lives.