Graphs and neural networks: Reading node properties

Published in

Octavian

12 min readSep 29, 2018

We often have questions like “What time is the meeting with Susan?” or “When does Walmart open?”. These questions ask for a property of an object in our world. In this article, I’ll show you how to answer these sorts of questions with a knowledge graph and a neural network.

It’s quite easy to do this using existing databases. For simplicity, we’ll assume that the question unambiguously¹ refers to a node in our graph. In a Neo4j graph database, one could simply query:

MATCH (a:SHOP {name:"Walmart}) RETURN a.opening_time

In this article we will take a pure neural-network approach. This enables us to do some things a traditional database cannot, however requires us to write more code and solve some interesting challenges.

All the code for this article is available in our GitHub.

Why solve this with a neural network?

Given that database engines can already recall node properties easily, why go to the effort of engineering a more complex neural network based solution?

This article comes from our work to build MacGraph, a sophisticated question-answering system that reasons over knowledge graphs. MacGraph is entirely neural network based. This approach can do things a traditional database engine cannot:

Take natural language (e.g. English) questions as input
Store, process and output probabilistic data
Handle linguistic and semantic similarities in its processing
Learn algorithms from question-answer examples²
Extract and combine multiple pieces of information from the knowledge graph in ways that are difficult to write as queries

By extracting node properties from the graph using a neural network, the entire reasoning system can be a single differentiable function, e.g. a TensorFlow graph, and can be trained using gradient descent and supervised learning.

Finally, building this system as a neural network is an interesting challenge!

The dataset we’ll solve: CLEVR-graph

When building a machine learning solution, and not achieving high accuracy, it can be hard to know whether the model has a deficiency or if the data has inherent noise and ambiguities.

To remove this uncertainty we’ve employed a synthetic dataset. This is data that we generated, based on our own set of rules. Thanks to the explicit structure of the data, we can be confident a good model can score 100% accuracy. This really helps when comparing different architectures.

CLEVR-graph contains a set of questions and answers about procedurally generated transport network graphs. Here’s what one of its transport networks looks like (it’s vaguely modeled on the London Tube) and some example questions and answers:

Every question in CLEVR-graph comes with an answer and a unique, procedurally generated graph.

The graph-question-answer triples are compiled into TFRecords. The training script will automatically download a compiled version of the dataset if not already present.

CLEVR-graph can generate many different types of questions. For this article we’ll generate just those that pertain to station properties. There are twelve of these (e.g. How clean is X? What music plays at Y?) and they are combined with a randomly chosen station from each graph to give a question-answer pair. As there are only twelve different question templates, the training data lacks the variety you’d get in a more natural (human) source. This makes the dataset easier to solve. We leave language diversity as a future extensional challenge (and would love to see readers’ solutions!).

The technologies we’ll use

In case you’re not familiar with the technologies we’re employing, I’ll give a brief introduction to them here.

Knowledge graphs

A knowledge graph, simply put, is a store of information with entities and relationships between them. It’s how we naturally draw out families:

Using a Neo4j graph database for a family tree

Graph structure, where each entity can have any number of relationships to other entities, is highly flexible and can represent many things in our world (e.g. transit networks, social networks, servers, Wikipedia).

The term “Knowledge Graph” was made popular when Google built their own, which now stores over 70 billion facts. They use it to help answer Google searches and “Ok Google” queries.

One of the great benefits of knowledge graphs is that they can store many types of information. It’s easy to extend a knowledge graph with new knowledge by adding more entities and adding connections between existing and new entities.

At Octavian, we believe that knowledge graphs are an important part of making systems with human-level reasoning.

TensorFlow

For the machine learning part of this system, we’ll use TensorFlow. TensorFlow is primarily a model building and training framework, letting us express our model and train it on our data. TensorFlow has quickly become one of the most popular and actively developed machine learning libraries.

TensorFlow can save a lot of development time. Once a model has been built in TensorFlow, the same code can be used for experimentation, development and deployment to production. There are many platforms like FloydHub, Google CloudML and AWS SageMaker that help with training and deploying TensorFlow models.

An overview of the system

The system is built in TensorFlow and achieves 100% accuracy on the synthetic dataset challenge. You can see the code in our GitHub.

Here’s an overview of the system:

The system begins by transforming the input question into integer tokens, which are then embedded as vectors.

Next, the control cell³ performs attention over the token vectors. This produces the control signal that is used by the subsequent cells to guide their actions.

Then the read cell uses the control signal to extract a node from the graph node list. It then extracts one property of that node. This cell will be explained in more detail later.

Finally, the output cell transforms the output of the read cell into an answer token (e.g. an integer that maps to an english word in our dictionary)

Now that we’ve shown an overview of the model, let’s describe how each of the pieces work. I’ll explain the different parts of the model and link to further explanations — it’s hard to fully explain and do justice to the background in just one article.

Parsing the input question

The first step of the model is to transform the English question into information the model can use:

Split the text into a series of ‘tokens’ (e.g. common words and special characters like ? and spaces)
Assign each unique token an integer ID and represent each instance of the token as that integer ID
Embed each token (e.g. word, special character) as a vector

The code for this process is in build.py in the accompanying GitHub repository.

Tokenizing the text is necessary as our sequence encoder will need a list of values to operate upon. We use the same tokenization scheme for the input question, the data in the knowledge graph and the answer output. This makes it possible for the network to use the words in the input question to query the nodes in the knowledge graph.

To make it easy to unambiguously refer to a station in the question, stations are named with randomly picked integers between one and seventy. Those station integers in the question are treated as tokens and get assigned a token integer ID (usually different from the station number).

Once tokenized into a list of integers, the tokens are embedded. This maps each token integer to a specific vector for that integer. This mapping (‘the embedding’) is learnt during training. It enables the network to learn a useful representation of tokens (for example, tokens that frequently play the same linguistic role may end up with a similar embeddings).

The control cell³

The next step in the model is to extract information from the question tokens to help the read cell read the right data from the knowledge graph. This produces the control signal.

We do this by performing attention over the embedded question tokens (more on that in a minute). This outputs a signal that is a combination of the embedded question token vectors.

For the dataset we’re solving, the control cell tends to extract the station name token vector from the question⁶. This is then used in the read cell to extract that station node from the graph.

The control cell performs a classic attention operation on the embedded question tokens. A variable is trained to produce the query, which is then combined with each question token vector to produce dot product scores. These scores are passed through softmax. Finally, the question token vectors are summed together, weighed by their scores. Later in this article we’ll refer to this as attention by value since the query is combined with the values in the list to produce the scores.

The read cell

The read cell is the heart of this network, it does the actual interaction with the knowledge graph. You can think of the read cell as our own miniature database engine.

The read cell does three things:

Apply the word embedding to the knowledge graph data
Extract a node⁵ from the knowledge graph
Extract a single property⁵ from that node

First, let’s explain how the knowledge graph is stored and presented to our network

Encoding the knowledge graph

For this model, we only need the nodes from the graph⁴.

These are represented as a two-dimensional table. Each row is a node. Each column is a different node property. Each value in the table is an integer that represents a token. Here’s an illustration of the structure, using the token string instead of token integer for readability:

How the read cell works

The read cell performs attention twice in a hierarchical fashion. The first attention extracts a node, the second attention extracts a property of that node. This produces the read signal.

The node extraction uses attention by value. Typically, the network uses the control signal to pull out the station name from the question. This is then compared with the name column of the table in the query operation and the relevant node extracted.

The property extraction uses attention by index. This is a less well known operation, so we’ll explain it here. The query input (control signal) is transformed via a dense layer into a vector with one element per item in the value list (the list of node properties). This is then softmaxed, and those scores are used to produce a weighted sum of the value list.

The effect of this attention by index operation is that the network can specify which column it wishes to extract from the node, agnostic of the values in it. This is helpful, as the question itself only contains weak information about what value to expect, but strongly guides which property column the model needs to extract.

The output cell

The final step is to transform the read signal into an answer token and output it. As is standard in classification networks, the output is a probability distribution. As there are 128 different tokens in our vocabulary, the output signal is a 128-wide vector that is fed into a cross-entropy loss function for training and argmax for prediction.

Model training and performance

The model has been trained with the Adam optimizer, with a learning rate of 0.001 and gradient clipping of 0.4. The token embedding is 64 wide and the batch size is 32.

To train the network yourself, clone it from our GitHub, pipenv install the dependencies, enter the virtual environment with pipenv shell then simply run train.sh.

After 8k training steps, the network achieves > 99% accuracy. This is as expected since the dataset was constructed to be free of noise and ambiguities.

By running tensorboard --logdir output/ you can see the training progress:

You can observe the inner workings of the network by running it in predict mode with predict.sh . The script will pretty-print each prediction and also what the attention operations focussed on:

Next steps

Thanks for following along this little adventure into neural database engines! We’ve successfully shown how to perform a basic query on a knowledge graph and hopefully given you some ideas to take to other problems.

There a lot of interesting ways to extend this network:

Read edges from the graph
Handle a wider range of question types and language variability
Handle ambiguous/relative node references
Extract multiple node properties then combine those

This work is part of MacGraph, our effort to build a neural reasoning engine that answers questions using knowledge graphs. It contains a lot of extensions to the mechanisms shown here, which we’ll soon write about.

Octavian’s research

Octavian’s mission is to develop systems with human-level reasoning capabilities. We believe that graph data and deep learning are key ingredients to making this possible. If you interested in learning more about our research or contributing, check out our website or get in touch.

Acknowledgements

Thanks to Andrew Jefferson for his proofing and feedback on this article. Thanks to Drew Hudson and Christopher Manning for the original inspiration for this implementation direction, and also for interesting email discussion.

Footnotes

In this work we restrict the questions to unambiguously refer to a single node (e.g. by including the node’s name/id in the question). This is unrealistic for most real world queries, and is done so this article can focus on explaining the attention mechanisms. However, because the attention approach is based on trained embeddings and a biLSTM, it is actually reasonably capable of resolving basic linguistic ambiguities. To truly deal with ambiguity, it needs to traverse the graph (e.g. to determine “Andy’s keys” or “my 10am meeting”) to filter down the node list to the best match. This sort of iterative reasoning is precisely what our main branch of work focuses on and will later present.
Broadly speaking, this is the field of Neural Turing Machines. These are notoriously difficult to train, as it is hard to ascertain an algorithm from observing an input-output pair. In a sense, the generality of neural Turing machines (they are Turing complete) makes their solution search space very large and hard to navigate. In MacGraph we’re attempting to overcome this problem by providing optimized primitives (e.g. convenient ways for the network to perform common operations like path searching) and restricting the range of algorithms we attempt to learn to a subset useful for “human reasoning” style activities. Furthermore, we employ curriculum learning (i.e. progressively solving larger versions of the same problem)
This naming scheme is derived from Compositional Attention Networks for Machine Reasoning by Drew A. Hudson, Christopher D. Manning. It is a little over-complicated for the code presented here; it is a vestige of the much larger codebase this has been extracted from.
MacGraph requires nodes, edges and their properties. Currently we encode this in three ways: A table of nodes, a table of edges and an adjacency matrix. Whilst the last two are redundant, having an adjacency matrix allows for message passing along the graph through simple vector algebra.
Strictly speaking, attention produces a weighted sum of its value list, therefore does not return “one item” as classical list indexing does. However, we’ve found that this model typically converges to highly focussed attention scores where one item receives the majority of the weighting. Therefore, for simplicity, we informally refer to this as retrieving one of the list items.
This behavior has varied with different iterations of this model architecture. Currently the focus is mostly on the station names, with a little attention paid to keywords that indicate the property to extract. In other variations of the model two control heads were used which sometimes both extracted the same tokens, and sometimes one would focus on station names whilst the other focussed on question-type indicating keywords.