Learn Japanese characters using Neo4j

Building a kanjis quiz app with GraphQL, React, and a Graph Database in 2 days

Jimmy Crequer
Neo4j Developer Blog
6 min readSep 20, 2019

--

TL;DR

In just two days, we were able to create a fully working kanjis quiz app, using the GRANDstack, by going through the following steps :

  • Import some CSV datasets to create our graph database.
  • Run the Jaccard algorithm to create more relationships between our nodes.
  • Run the PageRank algorithm to compute an additional property on some nodes.
  • Create APIs using GraphQL to generate random quiz questions.

You can find the source code from the Github repository.

Last week, I joined a 2-day hackathon event held in my company. The theme was “New, Fun, Speed” and our team aimed to build a small quiz app to learn the Japanese characters in a new and faster way, while having fun. My company is located in Japan and we thought other fellow foreigner colleagues could benefit from this idea.

We decided to use the GRANDstack to build our app :

  • GraphQL for the API endpoints
  • React for the frontend
  • Apollo for facilitating communication between API and frontend
  • Neo4j as our database technology to store the Japanese characters and build connections between them

Let’s get into it!

Build the graph

Import the datasets

We found a Japanese-Language Proficiency Test (JLPT) open dataset in a handy CSV format, composed of the kanji, its readings separated by “・” and its meanings, separated by “; ” as follows :

We were able to import this data using the following Cypher query from the Neo4j Browser. In addition, the JLPT has five levels: N1, N2, N3, N4 and N5, with N1 being the most difficult and N5 the easiest, so we decided to also add Level nodes to represent this difficulty.

Some Japanese characters share readings and meanings, and we already had pretty cool relationships at this stage, but we decided to enrich our dataset by adding radicals information and create more relationships between characters whose shape is composed of the same radicals. We were lucky to find a dataset that perfectly fitted our needs, having the following structure :

Each character in the “kanjiList” field must be treated independently. Since Neo4j allows flexible schema, it is very easy to add nodes/relationships on the fly! The key for us when importing this dataset was to use the “MATCH” keyword instead of the “MERGE” keyword because we wanted to add radicals for only kanjis that already existed in our dataset, and ignore the others (this second dataset had way more characters than our first one).

Here is how our graph looked like at this point.

Number of nodes and relationships per label
Example of the kanji “country”

Compute similarity using Jaccard algorithm

The next step for us was to actually use the relationships of our graph to find similar kanjis. To compute this, we decided to go for the Jaccard algorithm. Neo4j provides implementation for lots of algorithms and they are pretty straight-forward to use.

This algorithm created a new relationship called “SIMILAR” between our nodes labelled Kanji. Let’s look into some results.

Kanjis look very similar!

Compute score using PageRank algorithm

Our ultimate goal was to create a quiz app, and we wanted to implement a points system to give a reward to our users when they get a correct answer. To implement that, we decided to attribute a score to every kanji using PageRank algorithm.

Our reasoning was :

  • N1 kanjis are harder than N5 kanjis, and there are more of them
  • Kanjis sharing the same meaning, reading are harder to get right
  • Kanjis sharing the same radicals are harder to differentiate

In other words, “similar” kanjis can be considered harder. Our first try was to use the “SIMILAR” relationships computed in the previous step.

Then we set a new score property using the new “pageRank” property :

And we were very pleased with the results. Difficult kanjis were attributed a higher score than simple ones.

Top most difficult and easiest kanjis after computing PageRank algorithm

That’s it for the graph building! I am sure we could have improved it, by tuning some algorithm parameters and/or adding more data sources, but we were starting to run out of time and moved on to the next step : build the API!

Build the API

We wanted to create an API to fetch new questions for the quiz. At first, we wanted to retrieve, for a given kanji, one of its meaning (correct answer) and three different meanings (wrong answers) with a single endpoint.

Finding kanji’s meanings is pretty straight-forward as we just need to look at the “HAS_MEANING” relationship.

Finding wrong meanings is really where we were able to make use of our graph. Existing kanji quiz apps seem to just take random meanings and use them as wrong propositions, making it relatively easier to get the correct answer. We wanted our app to be more difficult. Our idea was to retrieve the meanings from kanjis that are similar to the one we are trying to guess, using the “SIMILAR” relationship computed by the Jaccard algorithm. Similar kanjis might have similar meanings and confusion can certainly happen.

Lastly, we added some randomness to get different output for the same kanji and here is how our schema looked like.

Usage example from the GraphQL playground.

Example of quiz question

With the same logic, we ended up adding more endpoints to diversify the type of quiz questions :

  • For a kanji, guess the correct reading
  • For a meaning, guess the correct kanji
  • And some more…

Summary

In just two days, we were able to create a fully working kanjis quiz app, using the GRANDstack, by going through the following steps :

  • Import some CSV datasets to create our graph database.
  • Run the Jaccard algorithm to create more relationships between our nodes.
  • Run the PageRank algorithm to compute an additional property on some nodes.
  • Create APIs using GraphQL to generate random quiz questions.

You can find the source code from the Github repository, please don’t hesitate to reach out if you have any comments or ideas.

--

--