Answering English questions using knowledge graphs and sequence translation

In sci-fi films like Star Trek and 2001: A Space Odyssey, crew-members walk around asking their ship questions about things in its encyclopedic database:

Crusher: Computer, what is the nature of the universe?
Computer: The universe is a spheroid region, 705m in diameter.
—Remember Me, Star Trek: The Next Generation

In this article, I’m going to show you how to create a basic version of this, using the magic of sequence-to-sequence translation. And yes, if you add the nature of the universe to your database, you can get some answers about it.

We’re going to create a system that is able to take an English language question, convert it into Cypher using a neural network, then run that query against a Neo4j graph database to produce an answer:

Here’s the system in action (it’s been trained on a database of transit networks and questions about them):

Thanks to the power of Cypher and the Neo4j database engine, our simple system is able to perform reasoning tasks such as:

  • Recalling properties of objects
  • Counting
  • Finding shortest paths
  • Determining if two objects have a relationship

The code for this article can be found in our GitHub and the training of the model and dataset is available as a FloydHub project.

Limitations

Whilst the system we’re building is quite impressive, it’s subject to various limitations:

  • It struggles with questions that are unlike what it’s seen before
  • It can generate Cypher queries that are poorly formed and will not execute

In summary, it has quite a shallow understanding of English, Cypher and their translation together. These limitations are overcomeable given more training data and deeper, more sophisticated neural networks, although we will not cover how to implement that here.

The technologies we’ll use

In case you’re not familiar with the technologies we’re employing, I’ll give a brief introduction to them here.

Knowledge graphs

A knowledge graph, simply put, is a store of information with entities and relationships between them. It’s how we naturally draw out families:

Using #neo4j graph database for my [extended] family tree

Graph structure, where each entity can have few or many relationships to any other other entity, is highly flexible and can represent many things in our world (e.g. transit networks, social networks, servers, Wikipedia).

The term “Knowledge Graph” was made popular when Google built their own, now storing over 70 billion facts. They use it to help answer Google searches and “Ok Google” questions.

One of the great benefits of knowledge graphs is that they can store many types of information. Therefore it’s easy to extend a knowledge graph with new knowledge by adding more entities and adding connections between existing and new entities.

At Octavian, we believe that knowledge graphs are an important part of making systems with human-level reasoning.

Cypher and the Neo4j graph database

To store and analyse the knowledge graph, we’ll use Neo4j. Neo4j is a popular, fast and free-to-use graph database.

To interact with Neo4j, one uses a query language called Cypher. Cypher is powerful and flexible, it can be used to do many things including:

  • Extract sub-graphs
  • Create tables of information from the graph entities and relations
  • Run algorithms like Shortest Path and PageRank on the graph
  • Count, filter and aggregate data

Our system will translate English into Cypher and then leverage the Neo4j graph database engine to do the heavy-lifting.

To give you an idea of how Cypher works, here’s an example Cypher query for an imagined database of product reviews:

MATCH g=(:PERSON) -[:WROTE]-> (:REVIEW) -[:OF]-> (:PRODUCT) 
RETURN g LIMIT 1

This looks for a node, of label PERSON, with a relationship of label WROTE, to a node of label REVIEW, with a relationship of label OF to a node of label PRODUCT. It returns a whole sub-graph of those three nodes and two relations. The qualifier “LIMIT 1” asks the database to just return one sub-graph that matches this pattern. Here’s what the result looks like in Neo4j Desktop:

TensorFlow

For the machine learning part of this system, we’ll use TensorFlow. TensorFlow is primarily a model building and training framework, letting us express our model and train it on our data. TensorFlow has quickly become one of the most popular and actively developed machine learning libraries.

TensorFlow can save a lot of development time. Once a model has been built in TensorFlow, the same code can be used for experimentation, development and deployment to production. There are many platforms like FloydHub, Google CloudML and AWS SageMaker that help with training and deploying TensorFlow models.

Sequence-to-Sequence translation

Sequence to sequence (seq2seq) translation is a powerful neural network technology, and is what powers Google Translate.

Seq2seq models typically employ two Recurring Neural Networks (RNNs), a special type of network that processes sequences. One RNN encodes the input (our English question) into data and a second RNN decodes that data into the output (our Cypher query). We’ll explain this arrangement in more detail later on.

The dataset we’ll use: CLEVR-graph

CLEVR-graph contains a set of questions and answers about procedurally generated transport network graphs. Here’s what one of its transport networks looks like (it’s vaguely modeled on the London Tube) and some example questions and answers:

Learn more about the dataset

Every question in CLEVR-graph comes with an equivalent Cypher statement and a unique, procedurally generated graph. This is ideal as we can directly train the model to translate the English questions into the Cypher queries.

For this model I used the gqa.generate script from CLEVR-graph to generate 670,000 Question-Cypher pairs, which I then extracted into all-src.txt and all-tgt.txt in the /data/ directory of the codebase using the e2c.build_data script.

CLEVR-graph contains 18 different types of questions with Cypher translations, and each Question-Answer-Graph triple has unique, randomly generated station and line names. The English questions are generated from templates like how many stations are between <station> and <station>?. There are only 18 of these templates so the training data doesn’t have the variation in language that you’d get in a more natural (human) source.

An overview of the system we’ll build

We take input from the user (as an English question), format it for the model, pass it through the seq2seq translation, format the model output into a cypher string, query that against Neo4j, then finally have an answer to the initial question.

Here’s an overview of that process:

The system achieves a test accuracy of 100% in translating the given English-Cypher pairs.

Translating English into Cypher

This is the most complicated piece of the system. It’s a Sequence to Sequence (seq2seq) model with some optimization for our specific task.

I’ll explain the different parts of the model and link to further explanations — it’s hard to fully explain and do justice to the components in just one article.

This sequence to sequence translation code is heavily based on the wonderful TensorFlow neural machine translation (NMT) codebase and tutorial, with some small improvements. I’ve stripped the NMT codebase down heavily to just the pieces needed for this task, making the code much simpler to read. The original seven model code files are now one single 360 line file.

Overview

The first part of almost any text processing system is pre-processing the input. We perform a range of operations to make it easier for the neural network to operate on the text, specifically:

The code for this process is in build_data.py in the accompanying GitHub repository.

Tokenizing the text is necessary as our sequence encoder will need a list of values to operate upon. I adopted a mixed tokenization scheme were common words are replaced with a token, and rare words are replaced with their spelling in character tokens. This was crucial to getting high test accuracy.

This mixed tokenization scheme balances different tradeoffs. Using word tokens for common words keeps the vocabulary list small (minimizing the complexity of the network) and reduces the work the network has to do to copy common words. Using character tokens for rare and unknown words provides a way for the network to handle rare/previously unseen words (e.g. station names).

Once tokenized into a list of integers, the tokens are embedded. This maps each integer to a specific vector for that integer. This mapping (‘the embedding’) is learnt during training. It enables the network to learn a useful representation of tokens (for example, tokens with similar meaning may end up with a similar embeddings)

Recurring neural networks

A Recurring Neural Network (RNN) is a network that is fed a sequence one piece at a time, and can pass forward an internal state to the next iteration:

This structure is great for sequences as it can consume them one piece at a time, for any length of sequence. Thanks to the state passed forward (the horizontal arrows above) the network can remember things it saw earlier in the sequence to help it understand later parts. They are surprisingly powerful.

Colah’s post on understanding neural networks is a great resource for understanding these networks, and worth reading now if you’re unfamiliar with them.

We use a special type of cell called a Long Short Term Memory (LSTM) cell. It has special gates to help it decide what information to remember and pass forward, and what information to forget. These gates make it much better at passing information forward for a long time (E.g. our sequences can be up to 300 tokens long).

By running the LSTM cell on the input question, we can “encode” it into data that can be used to generate the translation output. This encoding consists of the cell’s final passed forward state (the horizontal lines) and the output of the cell on each earlier token (the circles with h’s in them).

In a similar fashion, we have a second cell “decode” those states into Cypher. These cells take the previous outputted token as their input. The whole arrangement looks like this:

Remarkably this encoder-cell, decoder-cell arrangement, with some additional complexity, is able to perform very sophisticated translations.

You can see the whole model here, it’s only 360 lines including evaluation metrics and TensorFlow boilerplate.

In addition to what I’ve shown above, I stack two of these cells on each-other, use a bidirectional encoder, attention and also beam search decoding — these techniques all help increase the accuracy.

Putting it all together

The model takes roughly two hours to train on a Tesla K80. The RNN is resource hungry as it is a complex network that has to be semi-sequentially run for each token of each sequence batch. I trained the network on FloydHub, you can see the project and its output.

To train the network locally, run these commands in the english2cypher directory:

pipenv install
pipenv shell
python -m e2c.build_data
python -m e2c.train

Or alternatively train it on FloydHub (the dataset is already uploaded there):

./floyd-train.sh

As it trains it outputs predictions, which improve over time:

# 1 training step
<EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS>
# 20 training steps
MATCH MATCH MATCH ( MATCH var1 MATCH ]
# 166 training steps
MATCH ()-[var1]-()  MATCH (var2:LINE) WHERE var2.name="Red Crend"  WITH 1 AS foo, var1, var2.id AS var3 WHERE var1.line_id = var3  MATCH (var4)-[var1]-() WHERE var4.architecture = "conerete"  WITH 1 AS foo, var4 AS var5 WITH DISTINCT var5 as var6, 1 AS foo  RETURN length(collect(var6))

The test accuracy increases over time and the loss decreases:

I measure a range of evaluation metrics. eval_loss is a continuous measure of the output’s error, eval_guided_accuracy is how well the model predicts the next output token, given the correct previous one. eval_beam_mode_accuracy is whether the mode of the beam search predictions is correct — this corresponds exactly to how well the model will perform in deployment

Once you have a trained model (or just use ours, it will be automatically downloaded) you can try it out with python -m e2c.predict:

Next steps

The system achieves quite impressive things given it’s a very shallow model of the task trained on a simple, machine generated training set. There are a number of ways we could improve its capabilities:

  • Provide a broader range of questions
  • Provide more English variations in the training data
  • Provide more Cypher variations in the training data
  • Use the Cypher grammer to provide a training loss signal about which pieces of the output are mis-formed
  • Train encoder and decoder networks independently on corpuses of English and Cypher resprctively to learn their grammar.

It’s likely that the network would need to be made deeper to handle a wider range of English and Cypher expressions. For reference, our network uses two LSTM layers, and Google’s translation network uses eight, as well as other sophistications.

Further to this, you could build a network that skips Cypher all-together and directly operates upon the graph. The potential benefits of this is that it’s not limited by what is hard to do in Cypher, there may be smoother progression in parameter space between algorithms than there is between equivalent Cypher statements and it does not require manual provision of Cypher training data. This is an open area of research for Octavian and an architecture we plan to publish soon.

Octavian’s research

Octavian’s mission is to develop systems with human-level reasoning capabilities. We believe that graph data and deep learning are key ingredients to making this possible. If you interested in learning more about our research or contributing, get in touch.

Acknowledgements

Thanks to Andrew Jefferson for proofing and editing this article. Thanks to FloydHub for supplying some of the GPU time used for the training.