Review prediction with Neo4j and TensorFlow

David Mack


We show how to create an embedding to predict product reviews, using the TensorFlow machine learning framework and the Neo4j graph database. It achieves 97% validation accuracy.


A common problem in business is product recommendation. Given what a person has liked so far, what should we suggest they purchase next? Just as a waiter asking if you’d like another drink drives higher revenues, so does successful recommendations.

There are many approaches to recommendation. We’re going to focus on review prediction: given a product a person has not reviewed, what review would they give it? We can then recommend to that person the products we predict they will favorably review.

The code for the completed system is available in our GitHub.

The technologies we’ll use


We’re going to use a graph database as the data-source for this system. Graph databases are a powerful way to store and analyze data. Often the relationships between things, for example between people, are as important as the properties of those things themselves. In a graph database it’s easy to store and analyze those relationships. In this review prediction system we’ll be analyzing the network of reviews between different people and different products.

We’ll use Neo4j as our graph database. Neo4j is a popular, fast and free-to-use graph database (we provide a hosted database for this article’s dataset to save you having to set one up for yourself).


For the machine learning part of this system, we’ll use TensorFlow. TensorFlow is primarily a model building and training framework, letting us express our model and train it on our data. TensorFlow has quickly become one of the most popular and actively developed machine learning libraries.

TensorFlow can save a lot of development time. Once a model has been built in TensorFlow, the same code can be used for experimentation, development and deployment to production. Platforms like Google’s CloudML provide model hosting as a service, serving your model’s predictions as a REST API.

The problem

We’re going to be predicting product reviews. In our world there are people, who write reviews of products. Here’s what this looks like in a graph:

In a graph database we can query information based on patterns. Neo4j, the database we’ll use here, uses a query language called Cypher. The above graph was generated by a simple query:


This looks for a node, of label PERSON, with a relationship of label WROTE, to a node of label REVIEW, with a relationship of label OF to a node of label PRODUCT. The qualifier “LIMIT 1” asks the database to just return one instance that matches this pattern.

Neo4j implements a property graph model, in which nodes and relationships can have properties. This is a really flexible model allowing us to conveniently put data where we want.

The dataset we’ll train on

Our dataset contains 250 people and 50 products. Each person has 40 reviews, giving a total of 10,000 reviews.

You can use our hosted database, or generate the data into your own Neo4j instance using our generation codebase:

./ --dataset article_1

The dataset is synthetic — we generated it ourselves from a probabilistic model.

Using a synthetic dataset is useful technique during model development. If you’re applying an unproven method to unknown data and it fails to train, you cannot tell if the problem is the data or the model. By synthesizing the data one unknown is removed and you can focus on finding a successful model.

A synthetic dataset has limitations: it lacks the irregularities and errors typical of real world data. For this learning exercise, the synthetic dataset is very useful, but any real-world system will require more steps of cleaning the data and experimenting to find a model that fits it.

How we generated the dataset

Review nodes have a score property

Our synthetic data generation uses a simple probabilistic model. During generation each product and person has a randomly chosen category, and these categories are used to generate review scores. We save the people, products and reviews to the database and discard the category assignments (it would make the review prediction too easy).

In more detail:

We generate a set of 250 people and 50 products. Each person reviews 40 randomly chosen products.

Each person and product has a one-hot encoded vector of width 6. Think of this as choosing one category from six choices. For example, each product can be one of six colors (its style). Each person prefers one of those six colors (their preference).

Each review from a person to a product is calculated as the dot product of their vectors, giving 1.0 if they share the same style and preference, or 0.0 otherwise.

Finally, we assign the test property to a randomly selected 10% of the reviews. This data is used for evaluating the model, and is not used for training it.

Since each person reviews 40 randomly chosen products, it’s highly likely (although not certain) they will review one product of each of the six styles — therefore our review prediction challenge is well-constrained and we should be able to get close to 100%.

In academic literature our problem is known as “collaborative filtering”. By combining the reviews of many people (‘collaborative’) we can better recommend products for one person (‘filtering’).


We’re going to solve this review prediction problem by estimating a style vector for each product and a preference vector for each person. We’ll predict review scores by taking the dot product of those two vectors (since we know the data was generated using the dot product, it’s easy for us to guess this might be a successful solution).

The input to our model is the ID of the person and the ID of the product. The output of the model is the review score.

Review prediction is an interesting problem because we do not know the style of each product, nor the preference of each person, therefore we have to determine both simultaneously. A mistake in predicting a product’s style will then cause mistakes in predicting people’s preferences, so solving this is not trivial.

It should be noted that this is not “deep learning” (though it is machine learning). We’re using TensorFlow as a convenient framework to train a shallow model via gradient descent.

As an aside, adding deep layers to the model we defined above has been reported as successful in some academic papers.

Implementing our model in TensorFlow

(Note: I’ve simplified the code for presentation, removing classes and boilerplate.Check out the full working example with comments for the details.)

Embedding variables

The first step in our model is to transform person IDs and product IDs into estimated preference and style tensors. Thankfully, this is quite straightforward in TensorFlow.

We’ll store the preference and style estimations for all of the people and products as two variables, of shape [number_of_ids, width_of_tensor]:

product = tf.get_variable("product", [n_product, embedding_width])
person = tf.get_variable("person", [n_person, embedding_width])

We can use tf.nn.embedding_lookup(product, product_id) to transform an ID into a tensor of shape [width_of_tensor]

Format of the embedding tensors

Each embedding tensor will be floating point with 20 dimensions. Unlike in the data generation, there is no restriction for the value to be one-hot encoded.

Together, this design allows the model a lot of room to maneuver during training — this is helpful as gradient descent updates the variables with many small steps, and if it had to make a “big leap” to get to successful variables it may never get there. This design was determined through experimentation and grid search.

Model implementation

The model for our prediction is just eight lines long:

# Allocate storage for the estimations
product = tf.get_variable("product", [n_product, embedding_width])
person = tf.get_variable("person", [n_person, embedding_width])
# Retrieve the embedding tensors
product_emb = tf.nn.embedding_lookup(product, product_id)
person_emb = tf.nn.embedding_lookup(person, person_id)
# Dot product
m = tf.multiply(product_emb, person_emb)
m = tf.reduce_sum(m, axis=-1)
m = tf.expand_dims(m, -1) # So this fits as input for dense()
# A dense layer to fit the score to the range in the data
review_score = tf.layers.dense(m, (1), tf.nn.sigmoid)

The next step is to wrap up this model in the other pieces needed to train it. We’ll use the high-level Estimator API as it has pre-built routines for training, evaluating and serving the model, which we’d otherwise have to re-write.

The model function

The core of the Estimator framework is a model function.

This is a function we write and hand to TensorFlow, so that the framework can instantiate our model as often as it needs to (for instance, it might run multiple models across different GPUs/machines, or it might re-run the model with different learning rates to determine the best).

The model function is a python function that takes the input feature tensors (and some other parameters) and returns an EstimatorSpec which contains a few things:

  • A measure of the model’s loss (e.g. how well it’s fitting the training data)
  • A training operation (the ‘code’ to be executed to train the model in each step)
  • Evaluation metrics (the measures of model success we’ll view in TensorBoard)

For measuring loss we’ll use the built-in mean squared error:

loss = mean_squared_error(pred_review_score, label_review_score )

And for training operation we’ll use the built in Adam optimizer to minimize the loss:

train_op = tf.train.AdamOptimizer(params["lr"]).minimize(loss)

And we’ll measure one evaluation metrics, the accuracy:

eval_metric_ops = {
"accuracy": tf.metrics.accuracy(pred_review_score, label_review_score)

Finally returning an EstimatorSpec:

return tf.estimator.EstimatorSpec(

You can see the code all together in

Getting the data from Neo4j

We’ll use a Cypher query to get the data from our graph database and format it for training:

(review:REVIEW {dataset_name:"article_1", test:{test}})
RETURN as person_id, as product_id,
review.score as review_score

This returns one row for each review in our database. We then format each row for TensorFlow as a tuple of (input_dict, expected_output_score):

raw_data =, **query_params).data()def format_row(i):
return (
"person": {
"id": self._get_index(i, "person"),
"style": i["person_style"],
"product": {
"id": self._get_index(i, "product"),
"style": i["product_style"],
"review_score": i["review_score"],
data = [format_row(i) for i in raw_data]

Next, we construct a TensorFlow Dataset. This is high-level TensorFlow API that allows the framework to do a lot of the hard-work transforming and distributing our data for training. We’ll use the API to create a dataset from our generator, shuffle the data and batch it:

t =
lambda: (i for i in data),
t = t.shuffle(len(self))
t = t.batch(batch_size)

Shuffling helps the network learn as it will encounter different combinations of people and products in each batch.

Similar to the model function we created earlier, we will now create an input function. TensorFlow will construct a dataset many times during training (for example, when it reaches the end of the data and wishes to restart) and the input function gives it the ability to do so.

We create an input_fn for TensorFlow that requires no arguments:

input_fn = lambda: data.gen_dataset()

Putting it all together

Now that we have our model_fn and our input_fn we’re ready to train! We’re going to use the train_and_evaluate method of the Estimator API to coordinate the training and evaluation for us.

We construct an Estimator, specify the training data and number of steps in a TrainSpec and specify the evaluation data in an EvalSpec:

estimator  = tf.estimator.Estimator(model_fn, model_dir, vars(args))
train_spec = tf.estimator.TrainSpec(data_train.input_fn)
eval_spec = tf.estimator.EvalSpec(data_eval.input_fn,

We specify steps=None so that the whole evaluation set will be used (instead of just the first 100 items).

Now we’re ready to go, we can run the whole training and evaluation:

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Initial result: 92% evaluation accuracy

The Estimator framework will save our model and output summaries that TensorBoard can display for us.

Fire up TensorBoard and watch the progress:

tensorboard --logdir ./output

After 10,000 training steps the model achieves 92% accuracy:

This result is not too bad for such a simple implementation, but we can do better.

Improving training with random walks

Luckily, there is a short extension to our code that can help our model train to 97% accuracy.

We’re going to perform random walks across the graph. A random walk means to start at one graph node, randomly choose between the nodes it’s connected to, then do the same from that node, keeping your path in a list. It’s somewhat similar to how a drunk person traverses a city.

Illustration credit

The typical input to machine learning is fixed-size tabular data. Graphs can have any number of connections and nodes, therefore they do not readily fit into a fixed-size structure. This makes graphs hard to feed into machine learning.

Random walks are a very powerful way of capturing the connectivity of a graph in a simple data structure. Each walk outputs a list of fixed length. Each walk is a sample of the graph, and with a sufficient number of a random walks the entire connectivity of the graph is represented. They’ve been very successful in a number of areas including modeling language, social networks and protein structures.

Random walks benefit our training as it propagates style and preference embeddings across the graph. For example, imagine developing language separately in two islands over millennia. When the islanders meet for the first time, they are unlikely to understand each other at all and may forever struggle with each other’s languages (e.g. English and Japanese).

Hundred Islands

However, if instead the language was developed on one connected land-mass the words and grammar would travel across the land whilst they developed, providing a basic compatibility between the members of different countries (e.g. Spanish and Italian).

In a similar way, random walks help build compatibility between the style and preference vectors of people and products across our graph. When we estimate someone’s review for a product they’ve never reviewed (and perhaps none of the people near them in the graph have reviewed either) there’s more chance their preference vector speaks the same “language” as the target product’s style vector.


There are two steps to the implementation:

  1. Index the row data by the product and person IDs
  2. Sample batch_size length walks from the indexed data

Then we feed this data into our Dataset as before. Whilst the code is reasonably straightforward, it is a little long for displaying here. You can read online, or clone the whole repository.

Result: 97% evaluation accuracy

Training the model now achieves 97% evaluation accuracy.

Note that the model starts from randomly initialized variables, and receives randomly ordered training data, therefore it can achieve different results on each training run. The model does not always converge and does not always achieve its highest performance. I’ve shown below 20 separate trainings of the model, a few of which occasionally achieve as high as 98% accuracy:

Don’t leave chance up to chance! Multiple runs of the same model training with different random starting conditions and training data ordering

Next steps

Thanks for reading this far! There are many interesting problems to solve as a follow on from this one:

  • Introduce noise into the dataset
  • Generate review scores from a greater number of style and preference categories
  • Use a more complex model for review score generation
  • Generate a larger dataset and scale the model up to cope with it
  • Reduce the number of reviews per person (i.e. introduce greater sparsity)

All of the above can be synthesized easily using our generate-data codebase. Once you’ve generated the data it’s quite fun and addictive to try to find a successful predictive model.

Limitations of our approach

Whilst the approach in this article has achieved a high accuracy with few lines of code, it does have limitations. In particular:

  • It doesn’t know how to predict reviews for new people or products (the “cold start” problem)
  • GPUs have limitations to how large a variable can be stored in their memory, therefore how many people/products can be trained for
  • We’ve used a very simple dot-product model. If the model were more complex it could be difficult to simultaneously train the embedding and the deep model

There are many popular approaches to recommendation systems, Wikipedia is a good starting point to learning about others.

These writings are part of a year-long exploration of AI architecture topics. Applaud this article, follow this publication or follow my twitter to get updates when the next articles come out.

Feel free to let me know topics you’d like to learn more about.



David Mack

@SketchDeck co-founder, researcher, I enjoy exploring and creating.