Article recommendation with Personalized PageRank and Full Text Search

6 months ago Tomaz Bratanic wrote a great blog post showing how to build an article recommendation engine using NLP techniques and the Personalized PageRank algorithm from the Graph Algorithms library.

In the post Tomaz extracts key words for each article using the GraphAware NLP library, and then runs PageRank in the context of articles based on these key words.

I was curious whether I could create a poor man’s version of Tomaz’s work using the Full Text Search functionality that was added in Neo4j 3.5, and so here we are!

Tomaz explains how to import the data in his post, so we’ll continue from there. The diagram below shows the graph model that we’ll be working with. We have articles written by authors, and those articles can reference each other.

Graph Model

The first thing we need to do is create a Full Text Search index for our Article nodes. We’ll index the title and abstract properties on these nodes.

CALL db.index.fulltext.createNodeIndex('articlesAll', 
['Article'], ['title', 'abstract'])

We can check on the progress of the index creation by running the following query:

CALL db.indexes()

It will have a state of POPULATING while node properties are being added to the index. This state will change to ONLINE once it’s done. The following query will block until the index is online:

CALL db.index.fulltext.awaitIndex("articlesAll")

Now that we’ve done this, let’s get on with the algorithms.

Social Network Analysis Papers

Tomaz first explores articles that contain the phrase “social networks”. Let’s create a parameter containing that search term:

:param searchTerm => '"social networks"'

Not that we’ve put the search term in quotes. We do this so that Full Text Search will treat the term as a phrase rather than interpreting each term separately.

Now we want to call the PageRank algorithm from the point of view of articles that contain this search term. Let’s first see how many articles the full text index comes back with:

CALL db.index.fulltext.queryNodes("articlesAll", $searchTerm)
YIELD node, score
RETURN count(*)

Just under 15,000 nodes, or around 0.5% of all articles are returned by the query. The following query will return the top 10 articles for the search term:

CALL db.index.fulltext.queryNodes("articlesAll", $searchTerm)
YIELD node, score
RETURN node.id, node.title, score
LIMIT 10

Now we can feed these nodes into the PageRank algorithm as the sourceNodes config parameter. This will bias the results of the algorithm around these nodes.

The following query will find us the most influential articles about social networks:

CALL db.index.fulltext.queryNodes("articlesAll", $searchTerm)
YIELD node
WITH collect(node) as articles
CALL algo.pageRank.stream('Article', 'REFERENCES', {
sourceNodes: articles
})
YIELD nodeId, score
WITH nodeId,score
ORDER BY score DESC
LIMIT 10
RETURN algo.getNodeById(nodeId).title as article, score

As in Tomaz’s post, Sergey Brin and Larry Page’s paper describing Google shows up in first place.

Entropy to me is not entropy to you

In the next part of the post, Tomaz shows how we can write queries to find papers that would be interesting to researchers in different fields.

Recommendation of articles described by keyword “entropy” from the point of view of Jose C. Principe.

Let’s setup parameters:

:param authorName => "Jose C. Principe";
:param searchTerm => "entropy"

And now run the query:

MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL algo.pageRank.stream(
'CALL db.index.fulltext.queryNodes("articlesAll", $searchTerm)
YIELD node
RETURN id(node) as id',
'MATCH (a1:Article)-[:REFERENCES]->(a2:Article)
RETURN id(a1) as source,id(a2) as target',
{ sourceNodes: articles,
graph:'cypher',
params: {searchTerm: $searchTerm}})
YIELD nodeId, score
WITH author, nodeId, score
WITH algo.getNodeById(nodeId) AS n, score
WHERE not(exists((author)-[:AUTHOR]->(n)))
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10

We’ll see these results:

And what about if we run the same query for a different author?

:param authorName => "Hong Wang";

We’ll see this results:

We don’t get exactly the same results as Tomaz, but we do still get a different set of results for the different authors.

Summary

So in summary, it does seem that we can get a reasonable approximation of Tomaz’s post using Neo4j’s Full Text Search functionality.

If you have any other ideas of what we can do with this dataset, let me know by emailing devrel@neo4j.com