featurepreneur
Published in

featurepreneur

AmpliGraph — What is it?

This blog explains, what is AmpliGraph and how to use it.

What is AmpliGraph?

AmpliGraph is a suite of neural machine learning models for relational Learning, a branch of machine learning that deals with supervised learning on knowledge graphs.

Use AmpliGraph if you need to:

  • Discover new knowledge from an existing knowledge graph.
  • Complete large knowledge graphs with missing statements.
  • Generate stand-alone knowledge graph embeddings.
  • Develop and evaluate a new relational model.

AmpliGraph’s machine learning models generate knowledge graph embeddings, vector representations of concepts in metric space:

It then combines embeddings with model-specific scoring functions to predict unseen and novel links:

Data-Flow in knowledge graph:

AmpliGraph API includes the following submodules:

Installation

Prerequisites

  • Linux, macOS, Windows
  • Python ≥ 3.7

Provision a Virtual Environment

Create and activate a virtual environment (conda)

conda create --name ampligraph python=3.7
source activate ampligraph

Install AmpliGraph

Install the latest stable release from pip:

pip install ampligraph

If instead, you want the most recent development version, you can clone the repository and install from source as below (also see the How to Contribute guide for details):

git clone https://github.com/Accenture/AmpliGraph.git
cd AmpliGraph
git checkout develop
pip install -e .

Sanity Check

>> import ampligraph
>> ampligraph.__version__
'1.3.2'

Now we are going to use Game of Thrones data to plot the Ampligraph

Please note: This isn’t the greatest dataset for demonstrating the power of knowledge graph embeddings, but is small, intuitive and should be familiar to most users.

Let’s get Started with AmpliGraph

1. Dataset exploration

First things first! Let's import the required libraries and retrieve some data.

The Output is,

array([[‘Smithyton’, ‘SEAT_OF’, ‘House Shermer of Smithyton’], [‘House Mormont of Bear Island’, ‘LED_BY’, ‘Maege Mormont’], [‘Margaery Tyrell’, ‘SPOUSE’, ‘Joffrey Baratheon’], [‘Maron Nymeros Martell’, ‘ALLIED_WITH’, ‘House Nymeros Martell of Sunspear’], [‘House Gargalen of Salt Shore’, ‘IN_REGION’, ‘Dorne’]], dtype=object)

2. Defining train and test datasets

As is typical in machine learning, we need to split our dataset into training and test (and sometimes validation) datasets. What differs from the standard method of randomly sampling N points to make up our test set, is that our data points are two entities linked by some relationship, and we need to take care to ensure that all entities are represented in train and test sets by at least 1 triple. To accomplish AmpliGraph provides the train_test_split_no_unseen function.

We’ll stick to common practice and divide our training and test set in an 80/20 split.

3. Training a model

AmpliGraph has implemented several Graph embedding models (TransE, ComplEx, DistMult, HolE), but to begin with, we’re just going to use the ComplEx model (with default values).

Let’s go through the parameters to understand what’s going on:

  • k : the dimensionality of the embedding space
  • eta (η): the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
  • batches_count : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
  • epochs : the number of epochs to train the model for.
  • optimizer : the Adam optimizer, with a learning rate of 1e-3 set via the optimizer_params kwarg.
  • loss : pairwise loss, with a margin of 0.5 sets via the loss_params kwarg.
  • regularizer : Lp regularization with p=2, i.e. l2 regularization. λ = 1e-5, set via the regularizer_params kwarg.

4. Fitting the model

Once you run the next cell the model will train.

On a modern laptop, this should take ~3 minutes (although your mileage may vary, especially if you’ve changed any of the hyper-parameters above).

Average Loss:   0.021658: 100%|██████████| 200/200 [01:28<00:00,  2.25epoch/s]

5. Evaluating a model

Now it’s time to evaluate our model on the test set to see how well it’s performing.

For this, we’ll use the evaluate_performance function and let’s look at the arguments to this function:

  • X - the data to evaluate. We're going to use our test set to evaluate.
  • model - the model we previously trained.
  • filter_triples - will filter out the false negatives generated by the corruption strategy.
  • use_default_protocol - specifies whether to use the default corruption protocol. If True, then subj and obj are corrupted separately during evaluation.
  • verbose - will give some nice log statements. Let's leave it on for now.
WARNING - DeprecationWarning: use_default_protocol will be removed in future. Please use corrupt_side argument instead.100%|██████████| 635/635 [00:02<00:00, 288.28it/s]

The ranks returned by the evaluate_performance function indicate the rank at which the test set triple was found when performing link prediction using the model.

For example, given the triple:

<House Stark of Winterfell, IN_REGION The North>

The model returns a rank of 7. This tells us that while it’s not the highest likelihood true statement (which would be given a rank 1), it’s pretty likely.

6. Metrics

Let’s compute some evaluate metrics and print them out.

We’re going to use the mrr_score (mean reciprocal rank) and hits_at_n_score functions.

  • mrr_score: The function computes the mean of the reciprocal of elements of a vector of rankings ranks.
  • hits_at_n_score: The function computes how many elements of a vector of rankings ranks make it to the top n positions.

And the output is,

MRR: 0.30
Hits@10: 0.41
Hits@3: 0.33
Hits@1: 0.24

7. Predicting New Links

Link prediction allows us to infer missing links in a graph. This has many real-world use cases, such as predicting connections between people in a social network, interactions between proteins in a biological network, and music recommendation based on prior user taste.

In our case, we’re going to see which of the following candidate statements (that we made up) are more likely to be true:

100%|██████████| 22/22 [00:00<00:00, 159.31it/s]

We transform the scores (real numbers) into probabilities (bound between 0 and 1) using the expit transform.

Note that the probabilities are not calibrated in any sense.

The resulting DataFrame,

8. Visualizing Embeddings with Tensorboard projector

If all went well, we should now have several files in the AmpliGraph/tutorials/GoT_embeddings directory:

GoT_embeddings/
├── checkpoint
├── embeddings_projector.tsv
├── graph_embedding.ckpt.data-00000-of-00001
├── graph_embedding.ckpt.index
├── graph_embedding.ckpt.meta
├── metadata.tsv
└── projector_config.pbtxt

To visualize the embeddings in Tensorboard, run the following from your command line inside AmpliGraph/tutorials:

tensorboard --logdir=./visualizations

.. and once your browser opens up you should be able to see and explore your embeddings as below (PCA-reduced, two components):

Conclusion

In summary, AmpliGraph Graphs are at the crossroads of Data Base and Artificial intelligence to provide smart insights (or Knowledge) from very different types of data. Decision-makers can store all business knowledge as a set of connected vectors and use artificial neural networks to reason using this information.

The source code of AmpliGraph Library is available here

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store