The fastest way to convert your data to RDF

Published in

DataLens

10 min readOct 1, 2020

Creating bespoke parsers to convert data into RDF is an expensive and time-consuming process. Data-Lens have created a suite of products that utilise RML a superset of R2RML to convert your data whatever its type into RDF. There are five products or lenses for different use cases. The five Lenses can all be found on the Data-Lens AWS Marketplace home page and are as follows:

1. The Structured File Lens. Ingest structured files with ease. Supported types are CSV, JSON and XML. Full user documentation is available here

2. The SQL Lens. Ingest data from SQL Databases with ease. All popular SQL Databases are supported (JDBC connection). Full user documentation is available here

3. The RESTful Lens. Configure this lens to fetch data from any RESTful endpoint. The JSON:API specification is also supported. Full user documentation is available here

4. The Document Lens. Using AI technologies, the Document Lens will analyse and tag your documents to other data in your Knowledge Graph. Supported types are .docx .pdf and .txt. Full user documentation is available here

5. The Lens-Writer. Takes the RDF output from any of the other lenses and can write it to the Knowledge graph of your choice. Full user documentation is available here

In this blog, I’m going to show how easy it is to convert data to RDF using one of the suite of Data-Lens products. I’ll use the most straightforward of our lenses the Structured File Lens which can convert CSV, JSON, or XML files to RDF. The Lenses are available as docker images to be used wherever you wish but in this blog I will show you how to use the lens by obtaining it as a product from the AWS MarketPlace and running it within AWS.

To begin with we’ll take some sample data which I have from IMDB which is about 100 film titles and is in JSON format. The full file can be obtained here and the data about the titles looks like this.

[
  {
    "tconst": "tt0000001",
    "titleType": "short",
    "primaryTitle": "Carmencita",
    "originalTitle": "Carmencita",
    "isAdult": "0",
    "startYear": "1894",
    "endYear": "",
    "runtimeMinutes": "1",
    "genres": "Documentary,Short"
  },
  {
    "tconst": "tt0000002",
    "titleType": "short",
    "primaryTitle": "Le clown et ses chiens",
    "originalTitle": "Le clown et ses chiens",
    "isAdult": "0",
    "startYear": "1892",
    "endYear": "",
    "runtimeMinutes": "5",
    "genres": "Animation,Short"
  },

To help visualise the mapping we will create between the source JSON and target RDF, we would recommend that you draw an Ontology. This will give you a document/diagram describing your RDF model that you can discuss and share. The source data here is relatively simple consisting of only one object class IMDB:Title with predicates linking to several data type properties I created this Ontology diagram in Lucid Chart:

Once we know how our data is going to be modeled we can move onto the next part of the process which is the creation of a mapping file.

The mapping files are written using the RML language, RML is defined as a superset of the W3C-recommended mapping language, R2RML, that maps data in relational databases to RDF.RML mappings are themselves RDF graphs and written down in Turtle syntax. The full detail of how to create a mapping file is beyond the scope of this blog but detailed documentation can be found here: how to create a mapping file. I will just point out the key elements of the file:

The logical source

rml:logicalSource [
  rml:source "inputSourceFile.json";
  rml:referenceFormulation ql:JSONPath;
  rml:iterator "$.[*]";
];

This consists of :

A reference to the input source. The Structured File Lens transfers the source data file into a local file named inputSourceFile (N.B. the file name should never be changed due to this) with the suffix of whatever data type is being converted.
The Reference Formulation to specify how to refer to the data. Here we are using JSONPath which is the only option currently available when converting JSON
The iterator that specifies how to iterate over the data. “$.[*]” iterates over every object within the initial JSON array.

2. The subject map.

rr:subjectMap [
  rr:termType rr:IRI;
  rr:template "http://imdb.com/title/{tconst}";
  rr:class :Title ;
];

The subject Map generates the subject of all RDF triples that will be generated from a data element. The subjects often are IRIs that are generated from the primary key or ID portions of the data. Here we are using the tconst value in the source data.

{
“tconst”: “tt0000001”,
“titleType”: “short”,
“primaryTitle”: “Carmencita”,
“originalTitle”: “Carmencita”,
“isAdult”: “0”,
“startYear”: “1894”,
“endYear”: “”,
“runtimeMinutes”: “1”,
“genres”: “Documentary,Short”
},

This would create the below subject in an example triple

3. Multiple predicate-object maps

rr:predicateObjectMap [
  rr:predicate "http://imdb.com/primaryTitle";
  rr:objectMap [
    rr:template "{primaryTitle}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];

This consists of the predicate, which here is “http://imdb.com/primaryTitle" and the objectMap which is made up of :

rr:template. This refers to the JSON key “primaryTitle” whose value we are taking for the object.
rr:termType. This determines the kind of generated RDF term/object, in this case, a Literal.
rr:datatype. This determines the data type of the generated RDF term/object, in this case, a String.

Here is the relevant JSON key and value highlighted in the source date below. {
“tconst”: “tt0000001”,
“titleType”: “short”,
“primaryTitle”: “Carmencita”,
“originalTitle”: “Carmencita”,
“isAdult”: “0”,
“startYear”: “1894”,
“endYear”: “”,
“runtimeMinutes”: “1”,
“genres”: “Documentary,Short”
},

The example predicate-object map creates the following predicate and object from this data in the below triple.

Once we add a predicate-object map for all the remaining predicates in the model we end up with the following mapping file

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://imdb.com/> .@base <http://imdb.com/> .<TriplesMap1>
  a rr:TriplesMap;rml:logicalSource [
  rml:source "inputSourceFile.json";
  rml:referenceFormulation ql:JSONPath;
  rml:iterator "$.[*]";
];rr:subjectMap [
  rr:termType rr:IRI;
  rr:template "http://imdb.com/title/{tconst}";
  rr:class :Title ;
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/primaryTitle";
  rr:objectMap [
    rr:template "{primaryTitle}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/originalTitle";
  rr:objectMap [
    rr:template "{originalTitle}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/isAdult";
  rr:objectMap [
    rr:template "{isAdult}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/startYear";
  rr:objectMap [
    rr:template "{startYear}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/runtimeMinutes";
  rr:objectMap [
    rr:template "{runtimeMinutes}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/genres";
  rr:objectMap [
    rr:template "{genres}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
];rr:predicateObjectMap [
  rr:predicate "http://imdb.com/titleType";
  rr:objectMap [
    rr:template "{titleType}";
    rr:termType rr:Literal;
    rr:datatype xsd:string;
  ]
] .

With the source data and a mapping file, we can now use the Structured File Lens to generate our RDF. The below diagram gives a basic overview of how the Structured File Lens works.

The Lens runs as a service in Amazon ECS. For security, it is run within a private subnet. A bastion host is created within a public subnet and users can SSH into this and then securely make Rest requests to the ECS service. When a REST request is made to the /process endpoint the Lens downloads the mapping file that is specified during the ECS set up in CloudFormation and also downloads the input file specified in the REST request. Using these two files it generates an RDF output file that is sent to the specified directory in the S3 output bucket. A Response is sent to the REST Client which specifies the location of the output file. The Lens can also be triggered by Kafka and supports Kafka streaming. Documentation on this can be found here

Running the Structured File Lens

To complete this tutorial and convert the input JSON file to RDF it is necessary to Subscribe to the lens in AWS MarketPlace. The link to follow for the Lens is here. From this product, page click Continue to Subscribe

on the following page click Continue to Configuration

on the next page, click Continue to Launch.

On the final page, there is a quick stack creation link for the lens which you can click on to start the process of lens creation.

Once on the quick link page, all you need to do is add an EC2 key name to connect to the bastion host we use to connect to the ECS service and then fill out the mappings directory URL. This should be s3://data-lens-tutorials/Structured-File-Lens/fastest-way-blog/mapping/ (you do not need to specify the file name as the lens defaults to use a file named mapping.ttl) and also a bucket of your choice for the RDF output and provenance files to be uploaded to (these can be two separate buckets or you can use the same one if you prefer).

N.B.

Make sure you run DataLens in the same region as the buckets you are trying to access. In this example, the Lens and input Bucket are in AWS region us-east-1. For this reason the output buckets you use also need to be in us-east-1.
when running the lens in AWS you have to use the S3 path instead of the URL. e.g the mapping directory we are using in this tutorial has an S3 path of

s3://data-lens-tutorials/Structured-File-Lens/fastest-way-blog/mapping/

and a URL of

https://data-lens-tutorials.s3.amazonaws.com/Structured-File-Lens/fastest-way-blog/mapping/

The structure of the S3 path takes the form s3:// plus bucket and folder path. If you have a bucket called MyTestBucket and a folder called Output you wanted to use within that, the S3 path would be s3://MyTestBucket/Output

Finally, tick the checkbox to acknowledge that AWS CloudFormation might create IAM resources and press Create Stack.

As you can see from the Parameters section there is a ProvOutPutDirUrl. This is for the 2 provenance files the lens produces. Within the Structured File Lens, time-series data is supported as standard, every time a Lens ingests some data we add provenance information. This means that you have a full record of data over time, allowing you to see what the state of the data was at any moment. The model we use to record Provenance information is the W3C standard PROV-O model. For more information on how the provenance is laid out, as well as how to query it from your Triple Store, see the Provenance Guide.

Once the Cloud Formation stack creation has completed go to the Outputs section and take the Output values

As the lens is in a private subnet we need to ssh into the bastion host from our local terminal to be able to send REST requests to the lens. The format for this command is as follows

ssh -i "<private-key-file>" ec2-user@<bastion-host-dns-name>

We provided the key file name as one of our parameters. The bastion host dns name is provided as an Output from the stack. Putting these into the above format we have the command

ssh -i "cloudformation-key.pem" ec2-user@ec2-34-200-220-105.compute-1.amazonaws.com

From the bastion host we can then use curl to perform a GET request to the endpoint value with the input file URL added as a request parameter (The input file URL to be used should be s3://data-lens-tutorials/Structured-File-Lens/fastest-way-blog/input/imdb-titles-100.json) e.g.

curl --location --request GET 'http://internal-Structured-File-Lens-Only-887622255.us-east-1.elb.amazonaws.com/process?inputFileURL=s3://data-lens-tutorials/Structured-File-Lens/fastest-way-blog/input/imdb-titles-100.json'

Here I am making the request using curl and you can also see the successful output response.

The created RDF should now be available as an N-Quads file in the output location you specified in the CloudFormation Quick Stack creation page. Its content should look like this

<http:/imdb.com/title/tt0000001> <http://imdb.com/genres> "Documentary,Short" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/isAdult> "0" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/originalTitle> "Carmencita" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/primaryTitle> "Carmencita" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/runtimeMinutes> "1" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/startYear> "1894" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://imdb.com/titleType> "short" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000001> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://imdb.com/Title> <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/genres> "Animation,Short" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/isAdult> "0" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/originalTitle> "Le clown et ses chiens" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/primaryTitle> "Le clown et ses chiens" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/runtimeMinutes> "5" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/startYear> "1892" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://imdb.com/titleType> "short" <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .
<http:/imdb.com/title/tt0000002> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://imdb.com/Title> <http://www.data-lens.co.uk/f559cb71-8ab5-4b15-aea4-d5f51f88043d> .

If you want to double-check the full output of your file an example of the output can be found here.

Now you know how to transform data into RDF quickly and easily using the Structured File Lens. You can launch as many instances of the Structured File Lens or any of the other Data-Lens products as you have requirements or data feeds for. Links to the AWS Market Place product pages for all the Lenses are below and full documentation for the Data-Lens products can be found here. If you have any queries feel free to contact us through our website here

The Structured File Lens can be accessed here.

The SQL Lens can be accessed here.

The RESTful Lens can be accessed here.

The Document Lens can be accessed here.

The Lens Writer can be accessed here.

Happy converting!

The fastest way to convert your data to RDF

Running the Structured File Lens

Written by Richard Loveday