Learn Japanese characters using Neo4j

Building a kanjis quiz app with GraphQL, React, and a Graph Database in 2 days

Jimmy Crequer

Published in

Neo4j Developer Blog

6 min readSep 20, 2019

TL;DR

In just two days, we were able to create a fully working kanjis quiz app, using the GRANDstack, by going through the following steps :

Import some CSV datasets to create our graph database.
Run the Jaccard algorithm to create more relationships between our nodes.
Run the PageRank algorithm to compute an additional property on some nodes.
Create APIs using GraphQL to generate random quiz questions.

You can find the source code from the Github repository.

Last week, I joined a 2-day hackathon event held in my company. The theme was “New, Fun, Speed” and our team aimed to build a small quiz app to learn the Japanese characters in a new and faster way, while having fun. My company is located in Japan and we thought other fellow foreigner colleagues could benefit from this idea.

We decided to use the GRANDstack to build our app :

GraphQL for the API endpoints
React for the frontend
Apollo for facilitating communication between API and frontend
Neo4j as our database technology to store the Japanese characters and build connections between them

Let’s get into it!

Build the graph

Import the datasets

We found a Japanese-Language Proficiency Test (JLPT) open dataset in a handy CSV format, composed of the kanji, its readings separated by “・” and its meanings, separated by “; ” as follows :

国,コク・くに,country
高,コウ・たか.い・たか・~だか・たか.まる・たか.める,tall; high; expensive
今,コン・キン・いま,now
東,トウ・ひがし,east

We were able to import this data using the following Cypher query from the Neo4j Browser. In addition, the JLPT has five levels: N1, N2, N3, N4 and N5, with N1 being the most difficult and N5 the easiest, so we decided to also add Level nodes to represent this difficulty.

UNWIND ["5", "4", "3", "2", "1"] AS level
LOAD CSV FROM "https://raw.githubusercontent.com/jimmycrequer/roth-2019/master/neo4j/data/vocabulary_6501" + level + ".csv" AS row
MERGE (k:Kanji {value: row[0]})

WITH row, k, level
MERGE (l:Level {value: "N" + level})
WITH row, k, l
MERGE (k)-[:HAS_LEVEL]-(l)

WITH row, k
UNWIND split(row[1], "・") AS reading
MERGE (r:Reading {value: reading})
WITH row, k, r
MERGE (k)-[:HAS_READING]->(r)

WITH row, k
UNWIND split(row[2], "; ") AS meaning
MERGE (m:Meaning {value: meaning})
WITH row, k, m
MERGE (k)-[:HAS_MEANING]->(m)

Some Japanese characters share readings and meanings, and we already had pretty cool relationships at this stage, but we decided to enrich our dataset by adding radicals information and create more relationships between characters whose shape is composed of the same radicals. We were lucky to find a dataset that perfectly fitted our needs, having the following structure :

radical,meaning,kanjiList
｜,stick,亜唖逢悪以伊井稲印引....
丶,dot,以浦永泳詠往欧殴鴎蒲釜....

Each character in the “kanjiList” field must be treated independently. Since Neo4j allows flexible schema, it is very easy to add nodes/relationships on the fly! The key for us when importing this dataset was to use the “MATCH” keyword instead of the “MERGE” keyword because we wanted to add radicals for only kanjis that already existed in our dataset, and ignore the others (this second dataset had way more characters than our first one).

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/jimmycrequer/roth-2019/master/neo4j/data/radicals.csv" AS row
UNWIND split(row.kanjiList, "") AS kanjiMATCH (k:Kanji {value: kanji})WITH row, k
MERGE (r:Radical {value: row.radical})WITH k, r
MERGE (k)-[:HAS_RADICAL]->(r)

Here is how our graph looked like at this point.

Number of nodes and relationships per label

Compute similarity using Jaccard algorithm

The next step for us was to actually use the relationships of our graph to find similar kanjis. To compute this, we decided to go for the Jaccard algorithm. Neo4j provides implementation for lots of algorithms and they are pretty straight-forward to use.

MATCH (k:Kanji)-[]->(n)
WITH {item: id(k), categories: collect(id(n))} AS userData
WITH collect(userData) AS data
CALL algo.similarity.jaccard(data, {topK: 50, similarityCutoff: 0.1, write:true, writeProperty: "jaccardSimilarity"})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty

This algorithm created a new relationship called “SIMILAR” between our nodes labelled Kanji. Let’s look into some results.

MATCH (k1:Kanji)-[r:SIMILAR]->(k2:Kanji)
WITH k1, r, k2
ORDER BY r.jaccardSimilarity DESC
RETURN k1.value AS kanji, collect(k2.value)[0..5] AS similarKanjis
LIMIT 6

Compute score using PageRank algorithm

Our ultimate goal was to create a quiz app, and we wanted to implement a points system to give a reward to our users when they get a correct answer. To implement that, we decided to attribute a score to every kanji using PageRank algorithm.

Our reasoning was :

N1 kanjis are harder than N5 kanjis, and there are more of them
Kanjis sharing the same meaning, reading are harder to get right
Kanjis sharing the same radicals are harder to differentiate

In other words, “similar” kanjis can be considered harder. Our first try was to use the “SIMILAR” relationships computed in the previous step.

CALL algo.pageRank('Kanji', 'SIMILAR', {iterations:20, dampingFactor:0.85, weightProperty: "jaccardSimilarity"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
RETURN nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty

Then we set a new score property using the new “pageRank” property :

MATCH (k:Kanji)
SET k.score = round(k.pagerank * 100)

And we were very pleased with the results. Difficult kanjis were attributed a higher score than simple ones.

Top most difficult and easiest kanjis after computing PageRank algorithm

That’s it for the graph building! I am sure we could have improved it, by tuning some algorithm parameters and/or adding more data sources, but we were starting to run out of time and moved on to the next step : build the API!

Build the API

We wanted to create an API to fetch new questions for the quiz. At first, we wanted to retrieve, for a given kanji, one of its meaning (correct answer) and three different meanings (wrong answers) with a single endpoint.

Finding kanji’s meanings is pretty straight-forward as we just need to look at the “HAS_MEANING” relationship.

Finding wrong meanings is really where we were able to make use of our graph. Existing kanji quiz apps seem to just take random meanings and use them as wrong propositions, making it relatively easier to get the correct answer. We wanted our app to be more difficult. Our idea was to retrieve the meanings from kanjis that are similar to the one we are trying to guess, using the “SIMILAR” relationship computed by the Jaccard algorithm. Similar kanjis might have similar meanings and confusion can certainly happen.

Lastly, we added some randomness to get different output for the same kanji and here is how our schema looked like.

type Kanji {
  id: ID!
  value: String
  score: Int

  randomConnectedMeanings: [Meaning]
    @cypher(
      statement: """
        MATCH (this)-[:HAS_MEANING]->(m:Meaning)
        WITH m, rand() AS rand
        WITH m
        ORDER BY rand
        RETURN DISTINCT m
      """
    )

  randomNotConnectedMeanings: [Meaning]
    @cypher(
      statement: """
        MATCH (this)-[:SIMILAR]-(:Kanji)-[:HAS_MEANING]->(m:Meaning)
        WHERE NOT (this)-[:HAS_MEANING]->(m)
        WITH m, rand() AS rand
        WITH m
        ORDER BY rand
        RETURN DISTINCT m
      """
    )
}type Meaning {
  id: ID!
  value: String
}