Loading data and querying knowledge from a Grakn Knowledge Graph using the Python client

Soroush Saffari
Vaticle
Published in
12 min readOct 2, 2018

This tutorial may be out of date against the latest version of Grakn. For the most up-to-date version of this tutorial, please refer to the Grakn Documentation.

This tutorial illustrates, using the Grakn Python Client:

  • First: how to migrate a dataset in CSV, JSON or XML format into a Grakn knowledge graph.
  • Next: how to query our newly created knowledge graph to gain interesting insights over an example dataset.

The knowledge graph that we will work on in this post, is called phone_calls. The schema for this knowledge graph was defined in a previous post, here.

For the experienced engineer

If all you need is a good example that shows how migrating data into Grakn works, you’ll find what you’re looking for here.

The Step by Step Guide

If you’d like to follow this tutorial step by step, it’s important that you understand what we are working on here. This tutorial assumes that you have completed defining the schema for phone_calls; the knowledge graph we’ll be working on, in this post.

A Quick Look into the phone_calls Schema

Before we get started with migration, let’s have a quick reminder of how the schema for the phone_calls knowledge graph looks like.

First: Migrate Data into Grakn

Let’s go through an overview of how the migration takes place.

  1. we need a way to talk to our Grakn keyspace. To do this, we will use the Grakn’s Python Client.
  2. we will go through each data file, extracting each data item and parsing it to a Python dictionary.
  3. we will pass each data item (in the form of a Python dictionary) to its corresponding template function, which in turn gives us the constructed Graql query for inserting that item into Grakn.
  4. we will execute each of those queries to load the data into our target keyspace — phone_calls.

Before moving on, make sure you have Python3 and Pip3 installed and the Grakn server running on your machine.

Getting Started

  1. Create a directory named phone_calls on your desktop.
  2. cd to the phone_calls directory via terminal.
  3. Run pip3 install grakn to install the Grakn Python Client.
  4. Open the phone_calls directory in your favourite text editor.
  5. Create a migrate.py file in the root directory. This is where we’re going to write all our code.

Including the Data Files

Pick one of the data formats below and download the files. After you download them, place the four files under the phone_calls/data directory. We will be using these to load their data into our phone_calls knowledge graph.

CSV: companies | people | contracts | calls

JSON: companies | people | contracts | calls

XML: companies | people | contracts | calls

Setting up the migration mechanism

All code that follows is to be written in phone_calls/migrate.py.

First thing first, we import the grakn module. We will use it for connecting to our phone_calls keyspace.

Next, we declare the inputs. More on this later. For now, what we need to understand about inputs — it’s a list of dictionaries, each one containing:

  • The path to the data file
  • The template function that receives a dictionary and produces the Graql insert query. We will define these template functions in a bit.

Let’s move on.

build_phone_call_graph(inputs)

This is the main and only function we need to call to start loading data into Grakn.

What happens in this function, is as follows:

  1. A Grakn client is created, connected to the server we have running locally.
  2. A session is created, connected to the keyspace phone_calls. Note that by using with, we indicate that the session will close after it’s been used.
  3. For each input dictionary in inputs, we call the load_data_into_grakn(input, session). This will take care of loading the data as specified in the input dictionary into our keyspace.

load_data_into_grakn(input, session)

In order to load data from each file into Grakn, we need to:

  1. retrieve a list containing dictionaries, each of which represents a data item. We do this by calling parse_data_to_dictionaries(input)
  2. for each dictionary in items: a) create a transaction tx, which closes once used, b) construct the graql_insert_query using the corresponding template function, c) run the query and d)commit the transaction.

Note on creating and committing transactions: To avoid running out of memory, it’s recommended that every single query gets created and committed in a single transaction. However, for faster migration of large datasets, this can happen once for every n queries, where n is the maximum number of queries guaranteed to run on a single transaction.

Before we move on to parsing the data into dictionaries, let’s start with the template functions.

The Template Functions

Templates are simple functions that accept a dictionary, representing a single data item. The values within this dictionary fill in the blanks of the query template. The result will be a Graql insert query.

We need 4 of them. Let’s go through them one by one.

company_template

Example:

  • Goes in: { name: "Telecom" }
  • Comes out: insert $company isa company has name "Telecom";

person_template

Example:

  • Goes in: { phone_number: "+44 091 xxx" }
  • Comes out: insert $person has phone-number "+44 091 xxx";

or:

  • Goes in: { firs-name: "Jackie", last-name: "Joe", city: "Jimo", age: 77, phone_number: "+00 091 xxx"}
  • Comes out: insert $person has phone-number "+44 091 xxx" has first-name "Jackie" has last-name "Joe" has city "Jimo" has age 77;

contract_template

Example:

  • Goes in: { company_name: "Telecom", person_id: "+00 091 xxx" }
  • Comes out: match $company isa company has name "Telecom"; $customer isa person has phone-number "+00 091 xxx"; insert (provider: $company, customer: $customer) isa contract;

call_template

Example:

  • Goes in: { caller_id: "+44 091 xxx", callee_id: "+00 091 xxx", started_at: 2018–08–10T07:57:51, duration: 148 }
  • Comes out: match $caller isa person has phone-number "+44 091 xxx"; $callee isa person has phone-number "+00 091 xxx"; insert $call(caller: $caller, callee: $callee) isa call; $call has started-at 2018–08–10T07:57:51; $call has duration 148;

We’ve now created a template for each and all four concepts that were previously defined in the schema.

It’s time for the implementation of parse_data_to_dictionaries(input).

DataFormat-specific implementation

The implementation for parse_data_to_dictionaries(input) differs based on what format our data files have.

.csv, .json or .xml.

Parsing CSV

We will use Python’s built-in csv library. Let’s import the module for it.

Moving on, we will write the implementation of parse_data_to_dictionaries(input) for parsing .csv files. Note that we use DictReader to map the information in each row to a dictionary.

Besides this function, we need to make one more change.

Given the nature of CSV files, the dictionary produced will have all the columns of the .csv file as its keys, even when the value is not there, it’ll be taken as a blank string.

For this reason, we need to change one line in our person_template function.

if "first_name" in person becomes if person["first_name"] == "".

Parsing JSON

We will use ijson, an iterative JSON parser with a standard Python iterator interface.

Via the terminal, while in the phone_calls directory, run pip3 install ijson and import the module for it.

Moving on, we will write the implementation of parse_data_to_dictionaries(input) for processing.json files.

Parsing XML

We will use Python’s built-in xml.etree.cElementTree library. Let’s import the module for it.

For parsing XML data, we need to know the target tag name. This needs to be specified for each data file in our inputs deceleration.

And now for the implementation of parse_data_to_dictionaries(input) for parsing .xml files.

The implementation below, although, not the most generic, performs well with very large .xml files. Note that many libraries that do xml to dictionary parsing, pull in the entire .xml file into memory first. There is nothing wrong with that approach when you’re dealing with small files, but when it comes to large files, that’s just a no go.

Putting it all together

Here is how our migrate.py looks like for loading CSV data into Grakn, and find here are the ones for JSON and XML files.

Time to Load

Run python3 migrate.py

Sit back, relax and watch the logs while the data starts pouring into Grakn.

… so far with the migration

We started off by setting up our project and positioning the data files.

Next we went on to set up the migration mechanism, one that was independent of the data format.

Then, we went ahead and wrote the template functions whose only job was to construct a Graql insert query based on the data passed to them.

After that, we learned how files with different data formats can be parsed into Python dictionaries.

Lastly, we ran python3 migrate.py which fired the build_phone_call_graph function with the given inputs. This loaded the data into our Grakn knowledge graph.

Next: Query the Knowledge Graph

When we modelled and loaded the schema into Grakn, we had some insights in mind that we wanted to obtain from phone_calls; the knowledge graph.

Let’s revise:

  • Since September 14th, which customers called person X?
  • Who are the people who have received a call from a London customer aged over 50 who has previously called someone aged under 20?
  • Who are the common contacts of customers X and Y?
  • Who are the customers who 1) have all called each other and 2) have all called person X at least once?
  • How does the average call duration among customers aged under 20 compare those aged over 40?

For the rest of this post, we will go through each of these questions to:

  • understand their business value,
  • write them as a statement,
  • write them in Graql, and
  • assess their result.

Make sure you have the Workbase opened, while phone_calls selected as the keyspace (in the top-right hand corner).

Let’s begin.

Since September 14th, which customers called person X?

The business value:

The person with phone number +86 921 547 9004 has been identified as a lead. We (company "Telecom") would like to know which of our customers have been in contact with this person since September 14th. This will help us in converting this lead into a customer.

As a statement:

Get me the customers of company “Telecom” who called the target person with phone number +86 921 547 9004 from September 14th onwards.

In Graql:

match
$customer isa person has phone-number $phone-number;
$company isa company has name "Telecom";
(customer: $customer, provider: $company) isa contract;
$target isa person has phone-number "+86 921 547 9004";
(caller: $customer, callee: $target) isa call has started-at
$started-at;
$min-date == 2018-09-14T17:18:49; $started-at > $min-date;
get $phone-number;

The result:

[ '+62 107 530 7500', '+370 351 224 5176', '+54 398 559 0423', 
'+7 690 597 4443', '+263 498 495 0617', '+63 815 962 6097',
'+81 308 988 7153', '+81 746 154 2598']

Try it yourself

Using Workbase
USING THE GRAQL CONSOLE

The Graql Console is used to execute Graql queries from the command line, or to let Graql be invoked from other applications.

USING THE PYTHON CLIENT

Who are the people who have received a call from a London customer aged over 50 who has previously called someone aged under 20?

The business value:

We (company "Telecom") have received a number of harassment reports, which we suspect is caused by one individual. The only thing we know about the harasser is that he/she is aged roughly over 50 and lives in London. The reports have been made by young adults all aged under 20. We wonder if there is a pattern and so would like to speak to anyone who has received a call from a suspect, since he/she potentially started harassing.

As a statement:

Get me the phone number of people who have received a call from a customer aged over 50 after this customer (suspect) made a call to another customer aged under 20.

In Graql:

match
$suspect isa person has city "London", has age > 50;
$company isa company has name "Telecom";
(customer: $suspect, provider: $company) isa contract;
$pattern-callee isa person has age < 20;
(caller: $suspect, callee: $pattern-callee) isa call
has started at $pattern-call-date;
$target isa person has phone-number $phone-number,
has is-customer false;
(caller: $suspect, callee: $target) isa call
has started-at $target-call-date;
$target-call-date > $pattern-call-date;
get $phone-number;

The result:

[ '+30 419 575 7546',  '+86 892 682 0628', '+1 254 875 4647', 
'+351 272 414 6570', '+33 614 339 0298', '+86 922 760 0418',
'+86 825 153 5518', '+48 894 777 5173', '+351 515 605 7915',
'+63 808 497 1769', '+27 117 258 4149', '+86 202 257 8619' ]

Try it yourself

Using Workbase
USING THE GRAQL CONSOLE
USING THE PYTHON CLIENT

Who are the common contacts of customers X and Y?

The business value:

The customer with phone number +7 171 898 0853 and +370 351 224 5176 have been identified as friends. We (company "Telecom") like to know who their common contacts are in order to offer them a group promotion.

As a statement:

Get me the phone number of people who have received calls from both customer with phone number +7 171 898 0853 and customer with phone number +370 351 224 5176.

In Graql:

match
$common-contact isa person has phone-number $phone-number;
$customer-a isa person has phone-number "+7 171 898 0853";
$customer-b isa person has phone-number "+370 351 224 5176";
(caller: $customer-a, callee: $common-contact) isa call;
(caller: $customer-b, callee: $common-contact) isa call;
get $phone-number;

The result:

['+86 892 682 0628', '+54 398 559 0423']

Try it yourself

Using Workbase
USING THE GRAQL CONSOLE
USING THE PYTHON CLIENT

Who are the customers who 1) have all called each other and 2) have all called person X at least once?

The business value:

The person with phone number +48 894 777 5173 has been identified as a lead. We (company "Telecom") would like to know who his circle of  (customer) contacts are, so that we can encourage them in converting this lead to a customer.

As a statement:

Get me the phone phone number of all customers who have called each other as well the person with phone number +48 894 777 5173.

In Graql:

match
$target isa person has phone-number "+48 894 777 5173";
$company isa company has name "Telecom";
$customer-a isa person has phone-number $phone-number-a;
$customer-b isa person has phone-number $phone-number-b;
(customer: $customer-a, provider: $company) isa contract;
(customer: $customer-b, provider: $company) isa contract;
(caller: $customer-a, callee: $customer-b) isa call;
(caller: $customer-a, callee: $target) isa call;
(caller: $customer-b, callee: $target) isa call;
get $phone-number-a, $phone-number-b;

The result:

[ '+62 107 530 7500', '+261 860 539 4754', '+81 308 988 7153' ]

Try it yourself

USING THE GRAQ CONSOLE
USING THE PYTHON CLIENT

How does the average call duration among customers aged under 20 compare with those aged over 40?

The business value:

In order to better understand our customers' behaviour, we (company "Telecom") like to know how the average phone call duration among those aged under 20 compares to those aged over 40.

Two queries need to be executed to provide this insight.

Query 1: aged under 20

As a statement:

Get me the average call duration among customers who have a contract with company "Telecom" and are aged under 20.

In Graql:

match 
$customer isa person has age < 20;
$company isa company has name "Telecom";
(customer: $customer, provider: $company) isa contract;
(caller: $customer, callee: $anyone) isa call has duration
$duration;
aggregate mean $duration;

The result:

1348 seconds

Query 2: aged over 40

As a statement:

Get me the average call duration among customers who have a contract with company "Telecom" and are aged over 40.

In Graql:

match 
$customer isa person has age > 40;
$company isa company has name "Telecom";
(customer: $customer, provider: $company) isa contract;
(caller: $customer, callee: $anyone) isa call has duration
$duration;
aggregate mean $duration;

The result:

1587 seconds

Try it yourself

USING THE GRAQL SHELL
USING THE PYTHON CLIENT

👏 You’ve done it!

Five Graql queries, each written in a few lines, answered all of our questions. Our imaginary client, Telecom, can now take these insights back to their team and, hopefully, use them responsibly to serve their customers. You are the one who made it happen!

🚀 There is (always) more!

This, by no means, should be the end of our phone_calls Grakn knowledge graph. Try the interactive query executor , play around with it and write your own Graql queries to gain more interesting insights over this dataset.

Model the schema for your own brand new knowledge graph! Head to the Grakn Documentation and learn more about the cool things you can do with Grakn.

Talk to us, discuss your ideas and contribute to Grakn!

Stay tuned for more Grakn examples 😉

--

--