Loading data into a Grakn Knowledge Graph using the Java client

Published in

Vaticle

7 min readOct 31, 2018

This tutorial may be out of date against the latest version of Grakn. For the most up-to-date version of this tutorial, please refer to the Grakn Documentation.

This tutorial illustrates how a dataset in CSV, JSON or XML format can be migrated into a Grakn knowledge graph, using Grakn’s Java Client.

The knowledge graph that we’ll work on in this post, is called phone_calls. The schema for this knowledge graph was defined in a previous post, here.

If you’re already familiar with Grakn, and all you need is a migration example to follow, you’ll find this Github repository useful. If, on the other hand, you’re not familiar with the technology, make sure to first complete defining the schema for the phone_calls knowledge graph, and read on for a detailed guide on migrating data into Grakn using Java.

A Quick Look into the phone_calls Schema

Before we get started with migration, let’s have a quick reminder of how the schema for the phone_calls knowledge graph looks like.

The visualised schema for the phone_calls knowledge graph: blue nodes are attributes, red nodes are entities and the green nodes are relationships

Migrate Data into Grakn

Let’s go through an overview of how the migration takes place.

First, we need need to talk to our Grakn keyspace. To do this, we’ll use the Grakn’s Java Client.
We’ll go through each data file, extracting each data item and parsing it to a JSON object.
We’ll pass each data item (in the form of a JSON object) to its corresponding template. What the template returns is a Graql query for inserting that item into Grakn.
We’ll execute each of those queries to load the data into our target keyspace — phone_calls.

Before moving on, make sure you have Java 1.8 installed and the Grakn server running on your machine.

Getting Started

Create a new Maven project

This project uses SDK 1.8 and is named phone_calls. I’ll be using IntelliJ as the IDE.

Set Grakn as a dependency

Modify pom.xml to include the latest version of Grakn (1.4.2) as a dependency.

pom.xml to include Grakn as a dependency

Configure logging

We’d like to be able to configure what Grakn logs out. To do this, modify pom.xml to exclude slf4j shipped with grakn and add logback as a dependency, instead.

Next, add a new file called logback.xml with the content below and place it under src/main/resources.

Create the Migration Class

Under src/main create a new file called Migration.java. This is where we’re going to write all our code.

Including the Data Files

Pick one of the data formats below and download the files. After you download each of the four files, place them under the src/main/resources/data directory. We’ll be using these to load their data into our phone_calls knowledge graph.

CSV: companies | people | contracts | calls

JSON: companies | people | contracts | calls

XML: companies | people | contracts | calls

All code that follows is to be written in Migration.java.

Specifying details for each data file

Before anything, we need a structure to contain the details required for reading data files and constructing Graql queries. These details include:

The path to the data file, and
The template function that receives a JSON object and produces a Graql insert query.

For this purpose, we create a new subclass called Input.

Later in this article, we’ll see how an instance of the Input class can be created, but before we get to that, let’s add the mjson dependency to the dependencies tag in our pom.xml file.

Time to initialise the inputs.

The code below calls the initialiseInputs() method which returns a collection of inputs. We’ll then use each input element in this collection to load each data file into Grakn.

Input instance for a Company

input.getDataPath() will return data/companies.

Given company is

{ name: "Telecom" }

input.template(company) will return

insert $company isa company has name "Telecom";

Input instance for a Person

input.getDataPath() will return data/people.

Given person is

{ phone_number: "+44 091 xxx" }

input.template(person) will return

insert $person has phone-number "+44 091 xxx";

And given person is

{ firs-name: "Jackie", last-name: "Joe", city: "Jimo", age: 77, phone_number: "+00 091 xxx"}

input.template(person) will return

insert $person has phone-number "+44 091 xxx" has first-name "Jackie" has last-name "Joe" has city "Jimo" has age 77;

Input instance for a Contract

input.getDataPath() will return data/contracts.

Given contract is

{ company_name: "Telecom", person_id: "+00 091 xxx" }

input.template(contract) will return

match $company isa company has name "Telecom"; $customer isa person has phone-number "+00 091 xxx"; insert (provider: $company, customer: $customer) isa contract;

Input instance for a Call

input.getDataPath() will return data/calls.

Given call is

{ caller_id: "+44 091 xxx", callee_id: "+00 091 xxx", started_at: 2018–08–10T07:57:51, duration: 148 }

input.template(call) will return

match $caller isa person has phone-number "+44 091 xxx"; $callee isa person has phone-number "+00 091 xxx"; insert $call(caller: $caller, callee: $callee) isa call; $call has started-at 2018–08–10T07:57:51; $call has duration 148;

Connect and Migrate

Now that we have the datapath and template defined for each of our data files, we can continue to connect with our phone_calls knowledge graph and load the data into it.

connectAndMigrate(Collection<Input> inputs) is the only method that will be fired to initiate migration of the data into the phone_calls knowledge graph.

The following happens in this method:

A Grakn instance grakn is created, connected to the server we have running locally at localhost:48555.
A session is created, connected to the keyspace phone_calls.
For each input object in the inputs collection, we call the loadDataIntoGrakn(input, session). This will take care of loading the data as specified in the input object into our keyspace.
Finally the session is closed.

Loading the data into phone_calls

Now that we have a session connected to the phone_calls keyspace, we can move on to actually loading the data into our knowledge graph.

In order to load data from each file into Grakn, we need to:

retrieve an ArrayList of JSON objects, each of which represents a data item. We do this by calling parseDataToJson(input), and
for each JSON object in items: a) create a transaction tx, b) construct the graqlInsertQuery using the corresponding template, c) run the query, d)commit the transaction and e) close the transaction.

Note on creating and committing transactions: To avoid running out of memory, it’s recommended that every single query gets created and committed in a single transaction. However, for faster migration of large datasets, this can happen once for every n queries, where n is the maximum number of queries guaranteed to run on a single transaction.

Now that we’ve done all the above, we’re ready to read each file and parse each data item to a JSON object. It’s these JSON objects that will be passed to the template method on each Input object.

We‘re going to write the implementation of parseDataToJson(input).

DataFormat-specific implementation

The implementation for parseDataToJson(input) differs based on what format our data files have.

.csv, .json or .xml.

But regardless of what the data format is, we need the right setup to read the files line by line. For this, we’ll use an InputStreamReader.

Parsing CSV

We’ll use the Univocity CSV Parser for parsing our .csv files. Let’s add the dependency for it. We need to add the following to the dependencies tag in pom.xml.

Having done that, we’ll write the implementation of parseDataToJson(input) for parsing .csv files.

Besides this implementation, we need to make one more change.

Given the nature of CSV files, the JSON object produced will have all the columns of the .csv file as its keys, even when the value is not there, it’ll be taken as a null.

For this reason, we need to change one line in the template method for the input instance for person.

if (! person.has("first_name")) {...}

becomes

if (person.at(“first_name”).isNull()) {...}.

Reading JSON

We’ll use Gson’s JsonReader for reading our .json files. Let’s add the dependency for it. We need to add the following to the dependencies tag in pom.xml.

Having done that, we’ll write the implementation of parseDataToJson(input) for reading .json files.

Parsing XML

We’ll use Java’s built-in StAX for parsing our .xml files.

For parsing XML data, we need to know the name of the target tag. This needs to be declared in the Input class and specified when constructing each input object.

And now for the implementation of parseDataToJson(input) for parsing .xml files.

Putting it all together

Here is how our Migrate.java looks like for loading CSV data into Grakn, and find here the ones for JSON and XML files.

Time to Load

Run the main method, sit back, relax and watch the logs while the data starts pouring into Grakn.

To Recap

We started off by setting up our project and positioning the data files.
Next, we went on to set up the migration mechanism, one that was independent of the data format.
Then, we learned how files with different data formats can be parsed into JSON objects.
Lastly, we ran the main method which fired the connectAndMigrate method with the given inputs. This loaded the data into our Grakn knowledge graph.

In the next post (to be published soon), we’ll see how we can get insights over this dataset by querying the phone_calls knowledge graph using the Graql console, Grakn Workbase and the Java client. Stay tuned!