Graphs and neural networks: Reading node properties

David Mack
Sep 29, 2018 · 12 min read

We often have questions like “What time is the meeting with Susan?” or “When does Walmart open?”. These questions ask for a property of an object in our world. In this article, I’ll show you how to answer these sorts of questions with a knowledge graph and a neural network.

It’s quite easy to do this using existing databases. For simplicity, we’ll assume that the question unambiguously¹ refers to a node in our graph. In a Neo4j graph database, one could simply query:

MATCH (a:SHOP {name:"Walmart}) RETURN a.opening_time

In this article we will take a pure neural-network approach. This enables us to do some things a traditional database cannot, however requires us to write more code and solve some interesting challenges.

All the code for this article is available in our GitHub.

Why solve this with a neural network?

Given that database engines can already recall node properties easily, why go to the effort of engineering a more complex neural network based solution?

This article comes from our work to build MacGraph, a sophisticated question-answering system that reasons over knowledge graphs. MacGraph is entirely neural network based. This approach can do things a traditional database engine cannot:

  • Take natural language (e.g. English) questions as input

By extracting node properties from the graph using a neural network, the entire reasoning system can be a single differentiable function, e.g. a TensorFlow graph, and can be trained using gradient descent and supervised learning.

Finally, building this system as a neural network is an interesting challenge!

The dataset we’ll solve: CLEVR-graph

When building a machine learning solution, and not achieving high accuracy, it can be hard to know whether the model has a deficiency or if the data has inherent noise and ambiguities.

To remove this uncertainty we’ve employed a synthetic dataset. This is data that we generated, based on our own set of rules. Thanks to the explicit structure of the data, we can be confident a good model can score 100% accuracy. This really helps when comparing different architectures.

CLEVR-graph contains a set of questions and answers about procedurally generated transport network graphs. Here’s what one of its transport networks looks like (it’s vaguely modeled on the London Tube) and some example questions and answers:

Learn more about the dataset

Every question in CLEVR-graph comes with an answer and a unique, procedurally generated graph.

The graph-question-answer triples are compiled into TFRecords. The training script will automatically download a compiled version of the dataset if not already present.

CLEVR-graph can generate many different types of questions. For this article we’ll generate just those that pertain to station properties. There are twelve of these (e.g. How clean is X? What music plays at Y?) and they are combined with a randomly chosen station from each graph to give a question-answer pair. As there are only twelve different question templates, the training data lacks the variety you’d get in a more natural (human) source. This makes the dataset easier to solve. We leave language diversity as a future extensional challenge (and would love to see readers’ solutions!).

The technologies we’ll use

In case you’re not familiar with the technologies we’re employing, I’ll give a brief introduction to them here.

Knowledge graphs

A knowledge graph, simply put, is a store of information with entities and relationships between them. It’s how we naturally draw out families:

Using a Neo4j graph database for a family tree

Graph structure, where each entity can have any number of relationships to other entities, is highly flexible and can represent many things in our world (e.g. transit networks, social networks, servers, Wikipedia).

The term “Knowledge Graph” was made popular when Google built their own, which now stores over 70 billion facts. They use it to help answer Google searches and “Ok Google” queries.

One of the great benefits of knowledge graphs is that they can store many types of information. It’s easy to extend a knowledge graph with new knowledge by adding more entities and adding connections between existing and new entities.

At Octavian, we believe that knowledge graphs are an important part of making systems with human-level reasoning.

TensorFlow

For the machine learning part of this system, we’ll use TensorFlow. TensorFlow is primarily a model building and training framework, letting us express our model and train it on our data. TensorFlow has quickly become one of the most popular and actively developed machine learning libraries.

TensorFlow can save a lot of development time. Once a model has been built in TensorFlow, the same code can be used for experimentation, development and deployment to production. There are many platforms like FloydHub, Google CloudML and AWS SageMaker that help with training and deploying TensorFlow models.

An overview of the system

The system is built in TensorFlow and achieves 100% accuracy on the synthetic dataset challenge. You can see the code in our GitHub.

Here’s an overview of the system:

The system begins by transforming the input question into integer tokens, which are then embedded as vectors.

Next, the control cell³ performs attention over the token vectors. This produces the control signal that is used by the subsequent cells to guide their actions.

Then the read cell uses the control signal to extract a node from the graph node list. It then extracts one property of that node. This cell will be explained in more detail later.

Finally, the output cell transforms the output of the read cell into an answer token (e.g. an integer that maps to an english word in our dictionary)

Now that we’ve shown an overview of the model, let’s describe how each of the pieces work. I’ll explain the different parts of the model and link to further explanations — it’s hard to fully explain and do justice to the background in just one article.

Parsing the input question

The first step of the model is to transform the English question into information the model can use:

The code for this process is in build.py in the accompanying GitHub repository.

Tokenizing the text is necessary as our sequence encoder will need a list of values to operate upon. We use the same tokenization scheme for the input question, the data in the knowledge graph and the answer output. This makes it possible for the network to use the words in the input question to query the nodes in the knowledge graph.

To make it easy to unambiguously refer to a station in the question, stations are named with randomly picked integers between one and seventy. Those station integers in the question are treated as tokens and get assigned a token integer ID (usually different from the station number).

Once tokenized into a list of integers, the tokens are embedded. This maps each token integer to a specific vector for that integer. This mapping (‘the embedding’) is learnt during training. It enables the network to learn a useful representation of tokens (for example, tokens that frequently play the same linguistic role may end up with a similar embeddings).

The control cell³

The next step in the model is to extract information from the question tokens to help the read cell read the right data from the knowledge graph. This produces the control signal.

We do this by performing attention over the embedded question tokens (more on that in a minute). This outputs a signal that is a combination of the embedded question token vectors.

For the dataset we’re solving, the control cell tends to extract the station name token vector from the question⁶. This is then used in the read cell to extract that station node from the graph.

The control cell performs a classic attention operation on the embedded question tokens. A variable is trained to produce the query, which is then combined with each question token vector to produce dot product scores. These scores are passed through softmax. Finally, the question token vectors are summed together, weighed by their scores. Later in this article we’ll refer to this as attention by value since the query is combined with the values in the list to produce the scores.

The read cell

The read cell is the heart of this network, it does the actual interaction with the knowledge graph. You can think of the read cell as our own miniature database engine.

The read cell does three things:

  1. Apply the word embedding to the knowledge graph data

First, let’s explain how the knowledge graph is stored and presented to our network

Encoding the knowledge graph

For this model, we only need the nodes from the graph⁴.

These are represented as a two-dimensional table. Each row is a node. Each column is a different node property. Each value in the table is an integer that represents a token. Here’s an illustration of the structure, using the token string instead of token integer for readability:

How the read cell works

The read cell performs attention twice in a hierarchical fashion. The first attention extracts a node, the second attention extracts a property of that node. This produces the read signal.

The node extraction uses attention by value. Typically, the network uses the control signal to pull out the station name from the question. This is then compared with the name column of the table in the query operation and the relevant node extracted.

The property extraction uses attention by index. This is a less well known operation, so we’ll explain it here. The query input (control signal) is transformed via a dense layer into a vector with one element per item in the value list (the list of node properties). This is then softmaxed, and those scores are used to produce a weighted sum of the value list.

The effect of this attention by index operation is that the network can specify which column it wishes to extract from the node, agnostic of the values in it. This is helpful, as the question itself only contains weak information about what value to expect, but strongly guides which property column the model needs to extract.

The output cell

The final step is to transform the read signal into an answer token and output it. As is standard in classification networks, the output is a probability distribution. As there are 128 different tokens in our vocabulary, the output signal is a 128-wide vector that is fed into a cross-entropy loss function for training and argmax for prediction.

Model training and performance

The model has been trained with the Adam optimizer, with a learning rate of 0.001 and gradient clipping of 0.4. The token embedding is 64 wide and the batch size is 32.

To train the network yourself, clone it from our GitHub, pipenv install the dependencies, enter the virtual environment with pipenv shell then simply run train.sh.

After 8k training steps, the network achieves > 99% accuracy. This is as expected since the dataset was constructed to be free of noise and ambiguities.

By running tensorboard --logdir output/ you can see the training progress:

You can observe the inner workings of the network by running it in predict mode with predict.sh . The script will pretty-print each prediction and also what the attention operations focussed on:

Next steps

Thanks for following along this little adventure into neural database engines! We’ve successfully shown how to perform a basic query on a knowledge graph and hopefully given you some ideas to take to other problems.

There a lot of interesting ways to extend this network:

  • Read edges from the graph

This work is part of MacGraph, our effort to build a neural reasoning engine that answers questions using knowledge graphs. It contains a lot of extensions to the mechanisms shown here, which we’ll soon write about.

Octavian’s research

Octavian’s mission is to develop systems with human-level reasoning capabilities. We believe that graph data and deep learning are key ingredients to making this possible. If you interested in learning more about our research or contributing, check out our website or get in touch.

Acknowledgements

Thanks to Andrew Jefferson for his proofing and feedback on this article. Thanks to Drew Hudson and Christopher Manning for the original inspiration for this implementation direction, and also for interesting email discussion.

Footnotes

  1. In this work we restrict the questions to unambiguously refer to a single node (e.g. by including the node’s name/id in the question). This is unrealistic for most real world queries, and is done so this article can focus on explaining the attention mechanisms. However, because the attention approach is based on trained embeddings and a biLSTM, it is actually reasonably capable of resolving basic linguistic ambiguities. To truly deal with ambiguity, it needs to traverse the graph (e.g. to determine “Andy’s keys” or “my 10am meeting”) to filter down the node list to the best match. This sort of iterative reasoning is precisely what our main branch of work focuses on and will later present.

Octavian

Research into machine learning and reasoning