Learn Japanese characters using Neo4j
Building a kanjis quiz app with GraphQL, React, and a Graph Database in 2 days
TL;DR
In just two days, we were able to create a fully working kanjis quiz app, using the GRANDstack, by going through the following steps :
- Import some CSV datasets to create our graph database.
- Run the Jaccard algorithm to create more relationships between our nodes.
- Run the PageRank algorithm to compute an additional property on some nodes.
- Create APIs using GraphQL to generate random quiz questions.
You can find the source code from the Github repository.
Last week, I joined a 2-day hackathon event held in my company. The theme was “New, Fun, Speed” and our team aimed to build a small quiz app to learn the Japanese characters in a new and faster way, while having fun. My company is located in Japan and we thought other fellow foreigner colleagues could benefit from this idea.
We decided to use the GRANDstack to build our app :
- GraphQL for the API endpoints
- React for the frontend
- Apollo for facilitating communication between API and frontend
- Neo4j as our database technology to store the Japanese characters and build connections between them
Let’s get into it!
Build the graph
Import the datasets
We found a Japanese-Language Proficiency Test (JLPT) open dataset in a handy CSV format, composed of the kanji, its readings separated by “・” and its meanings, separated by “; ” as follows :
国,コク・くに,country
高,コウ・たか.い・たか・~だか・たか.まる・たか.める,tall; high; expensive
今,コン・キン・いま,now
東,トウ・ひがし,east
We were able to import this data using the following Cypher query from the Neo4j Browser. In addition, the JLPT has five levels: N1, N2, N3, N4 and N5, with N1 being the most difficult and N5 the easiest, so we decided to also add Level nodes to represent this difficulty.
UNWIND ["5", "4", "3", "2", "1"] AS level
LOAD CSV FROM "https://raw.githubusercontent.com/jimmycrequer/roth-2019/master/neo4j/data/vocabulary_6501" + level + ".csv" AS row
MERGE (k:Kanji {value: row[0]})
WITH row, k, level
MERGE (l:Level {value: "N" + level})
WITH row, k, l
MERGE (k)-[:HAS_LEVEL]-(l)
WITH row, k
UNWIND split(row[1], "・") AS reading
MERGE (r:Reading {value: reading})
WITH row, k, r
MERGE (k)-[:HAS_READING]->(r)
WITH row, k
UNWIND split(row[2], "; ") AS meaning
MERGE (m:Meaning {value: meaning})
WITH row, k, m
MERGE (k)-[:HAS_MEANING]->(m)
Some Japanese characters share readings and meanings, and we already had pretty cool relationships at this stage, but we decided to enrich our dataset by adding radicals information and create more relationships between characters whose shape is composed of the same radicals. We were lucky to find a dataset that perfectly fitted our needs, having the following structure :
radical,meaning,kanjiList
|,stick,亜唖逢悪以伊井稲印引....
丶,dot,以浦永泳詠往欧殴鴎蒲釜....
Each character in the “kanjiList” field must be treated independently. Since Neo4j allows flexible schema, it is very easy to add nodes/relationships on the fly! The key for us when importing this dataset was to use the “MATCH” keyword instead of the “MERGE” keyword because we wanted to add radicals for only kanjis that already existed in our dataset, and ignore the others (this second dataset had way more characters than our first one).
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/jimmycrequer/roth-2019/master/neo4j/data/radicals.csv" AS row
UNWIND split(row.kanjiList, "") AS kanjiMATCH (k:Kanji {value: kanji})WITH row, k
MERGE (r:Radical {value: row.radical})WITH k, r
MERGE (k)-[:HAS_RADICAL]->(r)
Here is how our graph looked like at this point.
Compute similarity using Jaccard algorithm
The next step for us was to actually use the relationships of our graph to find similar kanjis. To compute this, we decided to go for the Jaccard algorithm. Neo4j provides implementation for lots of algorithms and they are pretty straight-forward to use.
MATCH (k:Kanji)-[]->(n)
WITH {item: id(k), categories: collect(id(n))} AS userData
WITH collect(userData) AS data
CALL algo.similarity.jaccard(data, {topK: 50, similarityCutoff: 0.1, write:true, writeProperty: "jaccardSimilarity"})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty
This algorithm created a new relationship called “SIMILAR” between our nodes labelled Kanji. Let’s look into some results.
MATCH (k1:Kanji)-[r:SIMILAR]->(k2:Kanji)
WITH k1, r, k2
ORDER BY r.jaccardSimilarity DESC
RETURN k1.value AS kanji, collect(k2.value)[0..5] AS similarKanjis
LIMIT 6
Compute score using PageRank algorithm
Our ultimate goal was to create a quiz app, and we wanted to implement a points system to give a reward to our users when they get a correct answer. To implement that, we decided to attribute a score to every kanji using PageRank algorithm.
Our reasoning was :
- N1 kanjis are harder than N5 kanjis, and there are more of them
- Kanjis sharing the same meaning, reading are harder to get right
- Kanjis sharing the same radicals are harder to differentiate
In other words, “similar” kanjis can be considered harder. Our first try was to use the “SIMILAR” relationships computed in the previous step.
CALL algo.pageRank('Kanji', 'SIMILAR', {iterations:20, dampingFactor:0.85, weightProperty: "jaccardSimilarity"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
RETURN nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
Then we set a new score property using the new “pageRank” property :
MATCH (k:Kanji)
SET k.score = round(k.pagerank * 100)
And we were very pleased with the results. Difficult kanjis were attributed a higher score than simple ones.
That’s it for the graph building! I am sure we could have improved it, by tuning some algorithm parameters and/or adding more data sources, but we were starting to run out of time and moved on to the next step : build the API!
Build the API
We wanted to create an API to fetch new questions for the quiz. At first, we wanted to retrieve, for a given kanji, one of its meaning (correct answer) and three different meanings (wrong answers) with a single endpoint.
Finding kanji’s meanings is pretty straight-forward as we just need to look at the “HAS_MEANING” relationship.
Finding wrong meanings is really where we were able to make use of our graph. Existing kanji quiz apps seem to just take random meanings and use them as wrong propositions, making it relatively easier to get the correct answer. We wanted our app to be more difficult. Our idea was to retrieve the meanings from kanjis that are similar to the one we are trying to guess, using the “SIMILAR” relationship computed by the Jaccard algorithm. Similar kanjis might have similar meanings and confusion can certainly happen.
Lastly, we added some randomness to get different output for the same kanji and here is how our schema looked like.
type Kanji {
id: ID!
value: String
score: Int
randomConnectedMeanings: [Meaning]
@cypher(
statement: """
MATCH (this)-[:HAS_MEANING]->(m:Meaning)
WITH m, rand() AS rand
WITH m
ORDER BY rand
RETURN DISTINCT m
"""
)
randomNotConnectedMeanings: [Meaning]
@cypher(
statement: """
MATCH (this)-[:SIMILAR]-(:Kanji)-[:HAS_MEANING]->(m:Meaning)
WHERE NOT (this)-[:HAS_MEANING]->(m)
WITH m, rand() AS rand
WITH m
ORDER BY rand
RETURN DISTINCT m
"""
)
}type Meaning {
id: ID!
value: String
}
Usage example from the GraphQL playground.
With the same logic, we ended up adding more endpoints to diversify the type of quiz questions :
- For a kanji, guess the correct reading
- For a meaning, guess the correct kanji
- And some more…
Summary
In just two days, we were able to create a fully working kanjis quiz app, using the GRANDstack, by going through the following steps :
- Import some CSV datasets to create our graph database.
- Run the Jaccard algorithm to create more relationships between our nodes.
- Run the PageRank algorithm to compute an additional property on some nodes.
- Create APIs using GraphQL to generate random quiz questions.
You can find the source code from the Github repository, please don’t hesitate to reach out if you have any comments or ideas.