# Analyse Neo4j Graph of Books domain model with Jaccard Similarity Algorithm

## May 27: Library Day, One Month Graph Challenge

# Welcome word

In this series of small posts I do one simple graph daily. Domain model of graph somehow related to day’s history, some historical event, celebration or person. I do this challenge to learn Neo4j Data Modeling and Cypher. Every day. One month. Follow me. Maybe you will be inspired and next month would be yours One Month Graph Challenge. #OMGChallenge

# Domain model

Today there are not much world wide events happened. And after additional research I decide to stop on best, from my point of view, celebration:

All-Russian Library Day, celebrated annually on May 27.

I think, it is a good possibility to talk about Russian Literature and many other great creations, that did huge impact on whole world culture.

My small graph would be based on books-related domain model. I want to try to find any interesting numbers and then try to expand relationships based on this mined data. Maybe also will try to find some useful recommendations for myself, based on my reading interests. Let’s see how far it will go.

# Graph

First of all, let’s load HTML and parse data with help of APOC function from quite cute book online service I found. As you can see below, I decide to collect books from top pf Russian literature, “Best by century” and my lovely Sci-Fi genre.

`UNWIND [`

"113",//best russian literature

"16",//best of 19th

"6",//best of 20th

"7",//best of 21st

"82083",//best millennium

"114319"//best sci-fi of 20th

] as url

CALL apoc.load.html("https://www.goodreads.com/list/show/" + url, { main: "table.tableList.js-dataTooltip tbody tr td a.bookTitle", author: "table.tableList.js-dataTooltip tbody tr td a.authorName", score: "table.tableList.js-dataTooltip tbody tr td span.greyText.smallText.uitext" }) YIELD value

WITH collect(value) as data, url

WITH { main: data[2].main, author: data[1].author, score: data[0].score } as root, url

WITH [ x IN range(0, length(root.main) - 1) | { title: root.main[x].text, url: root.main[x].attributes.href, author: root.author[x].text, scoreBase: root.score[x].text }] as result

UNWIND result as item

WITH item.title as title, item.url as url, item.author as author, apoc.text.regexGroups(item.scoreBase, "(\\d+.\\d+)(.+ )(\\d+,*\\d+)")[0] as groups

WITH title, url, author, toFloat(groups[1]) as score, toInteger(apoc.text.replace(groups[3], ",", "")) as readersNum

MERGE (b:Book { title: title })

ON CREATE SET b.url = url, b.score = score, b.readersNum = readersNum

ON MATCH SET b.url = url, b.score = score, b.readersNum = readersNum

MERGE (a:Author { name: author })

MERGE (b)-[:WRITTEN_BY]->(a)

Interesting case to mention, if you go with something like this you will got into a troubles:

`MERGE (b:Book { title: title , url: url, score: score, readersNum: readersNum })`

And here explanation why: same book can appear in several categories. And between html parsing of this book, people on web-site high can actively adding more scores and “reads”. This is why in sereval ms same book can already have new value of score and readersNum. And lead to new book node as we set all 4 property at once.

`MERGE (b:Book { title: title })`

ON CREATE SET b.url = url, b.score = score, b.readersNum = readersNum

ON MATCH SET b.url = url, b.score = score, b.readersNum = readersNum

This approach gives me correct result. I match in MERGE by title only and then update book properties with new values from website, whatever are they changed or not.

I have 521 books by 315 authors. Still not enough for me. I want to know genres for every book. This is why I include **url** property in previous query. Let’s add top 5 genres specified for each book.

Next query is high time consuming. Be sure you have cup of tea nearby.

`MATCH (b:Book)`

CALL apoc.load.html("https://www.goodreads.com" + b.url, { data: "div.left a.actionLinkLite.bookPageGenreLink" }) YIELD value

UNWIND value.data[..5] as item

MERGE (g:Genre { name: item.text })

MERGE (b)-[:OF]->(g)

Let’s see on authors with biggest number of books in classics genre.

`MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)`

WHERE g.name = "Classics"

RETURN a.name as author, count(b) as books

ORDER BY books DESC

LIMIT 10

I want to know the who have biggest amount of readers. But because we have here a big difference of books per each author, I think, it is good to compare them by some sum of readers only for N books. This N is an average number of books for author. This approach must be much more fair.

So, once again: I will find an average number of books (N) in graph, cut authors with less books, plus will take only N books from those, who have more books in this data. Then I will find sum of readers of these books and compare.

`MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)`

WHERE g.name = "Classics"

WITH g, a, count(b) as booksNum

WITH avg(booksNum) as booksAvg

WITH toInteger(ceil(booksAvg)) as avg

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)

WHERE g.name = "Classics"

WITH a, count(b) as booksNum, collect({ readersNum: b.readersNum }) as readers, avg

WHERE booksNum >= avg

WITH a.name as author, readers, avg

WITH author, apoc.coll.sortMaps(readers, 'readersNum')[..avg] as topReaded

WITH author, reduce(r = 0, x IN topReaded | r + x.readersNum) as readersSum

RETURN author, readersSum

ORDER BY readersSum DESC

LIMIT 15

This is great to see, that Leo Tolstoy and Fyodor Dostoyevsky have really big readers auditory even at US located resource. Classics, what to say.

Now I want to come back to genres and analyse them a little bit more.

For example, top 10 author working in many genres:

`MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)`

WITH a.name as author, g.name as genre

WITH author, collect(DISTINCT genre) as genres

WITH author, size(genres) as genresNum

RETURN author, genresNum

ORDER BY genresNum DESC

LIMIT 10

Much more interesting to find authors by similarity of genres they both work with. To solve this task we can use Jaccard Similarity algorithm.

`MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)`

WITH { item: id(a), categories: collect(id(g)) } as authorData

WITH collect(authorData) as data

CALL algo.similarity.jaccard.stream(data) YIELD item1, item2, count1, count2, intersection, similarity

WITH algo.asNode(item1).name AS author1, algo.asNode(item2).name AS author2, intersection, similarity

WHERE similarity > 0.6

RETURN author1, author2, intersection, similarity

ORDER BY intersection DESC, similarity DESC

LIMIT 10

You can play with ORDER BY statements, put similarity into the first place to find pairs of authors similar by genres. Or, for example, count of pairs of totally “genre similar” authors.

`...`

WHERE similarity = 1.0

RETURN count(similarity) as c

Now, based on this information we can link authors with similarity relationship.

`MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)`

WITH { item: id(a), categories: collect(id(g)) } as authorData

WITH collect(authorData) as data

CALL algo.similarity.jaccard.stream(data) YIELD item1, item2, count1, count2, intersection, similarity

WITH algo.asNode(item1) AS author1, algo.asNode(item2) AS author2, intersection, similarity

WHERE similarity > 0.5

MERGE (author1)-[:SIMILAR_TO { value: similarity }]->(author2)

I want to find some similar authors in Sci-Fi genre, very close to Ray Bredbery and Philip K. Dick and good review score from readers.

`MATCH (a:Author)-[s:SIMILAR_TO]->(aa:Author)<-[:WRITTEN_BY]-(b:Book)`

WHERE (a.name = "Ray Bradbury" OR a.name = "Philip K. Dick") AND b.score > 4.0 AND s.value > 0.5

WITH aa.name as author, s.value as similarity, collect(b) as books

WITH author, similarity, apoc.coll.sortMaps(books, 'score') as books

RETURN author, similarity, [x IN books | x.title] as books

ORDER BY similarity DESC

Nice. Looks like I found some interesting books to read in the next month. If you have any idea about to use existing domain — share in comments!

# Resume

Today I did pretty simple, but useful graph analysis. Books are wonderful. Graph of books wonderful twice.

Using the fact, that this is my blog, I want to give personal advice for those who love Sci-Fi, if you not mind: Ubik by Philip K. Dick and The Invincible by Stanislav Lem. Have a good reading!

# Similar topics

Neo4j PageRank Algorithm and Path Pattern Matching on Flights Domain Model

Unusual use of Neo4j Common Neighbors Algorithm on people community graph

Neo4j Betweenness Centrality Algorithm detailed and fun example

Neo4j kShortestPaths Algorithm apply to City Map

Neo4j Cosine Similarity Algorithm example