A practical review of non-RDF to RDF converters
I believe, converting bunch of CSV, JSON or XML files to an RDF format is a very popular task for people dealing with Semantic Web, Linked Data, RDF and such stuff. However, I can’t easily name a ready-to-use converter that I’d recommend, so I’ve decided to review some of them to use the most suitable one next time.
I found more than a dozen of converters, but included in the review only those which:
- are alive and well maintained,
- I’ve managed to install and use.
In the result we have six tools. I also classified them based on the functions they can perform: extract, transform, load. Here is the tools (name, functions, version and license):
- Tarql — T, master@f616414, BSD 2-Clause
- RML-Mapper — T, master@cdd2464, No license
- SPARQL-Generate — T, 1.2.2, Apache License 2.0
- SETLr — TL, master@873f8da, Apache License 2.0
- Karma — TL, 2.1, Apache License 2.0
- Linked Pipes ETL — ETL, master@d446615a, MIT License
Those that support only the transform function may be useful for occasional runs or as a part of a bigger pipeline. The others may be used to build the whole pipeline.
Review: feature comparison
I’ve looked at seven features I find important to know about when selecting such tool. The first three ones are:
- Processing mode — Input file is loaded in the memory and processed as one piece (batch). It’s split in parts and each part loaded and processed seperatelly (stream). Or each file as whole or splitted in parts is processed on a cluster (cluster).
- Input formats — In this review I’m considering only file-based formats, so you won’t see the RDB-to-RDF and such stuff.
- Sources — Where input file may be loaded from, from a local path, by URL, etc.
Here is the table with information about these three features:
Most of the popular file formats are supported, except the tools specialized for the column-based formats. Sad to say, but the stream-based processing isn’t well supported, so you’ll have to control the file sizes.
Other four features are:
- Mapping language — a declarative language which describes extraction, transformation and loading (aka ETL steps).
- Graphical editor — an UI which allows to define the ETL steps visually.
- Joins — a mechanism to define a join function between two or more input files and run the transformation on top of resulted data.
- Custom functions — an ability to implement a function which aren’t built-in into the standard distribution of the tool. E.g. Jena ARQ engine allows to implement a custom function for SPARQL as a Java class and include it in the distribution.
So the ones that don’t have a graphical editor support a declarative mapping language, the rest provide an UI instead. The joining is a powerful mechanism which is especially useful when it’s not possible to create stable URIs for entities that bind data from different input files. Unfortunatelly only two tools support it, RML-Mapper and SPARQL-Generate.
Review: communities comparison
Communities are compared based on the activity on Github, since all the source codes are published on it. Meaning of the columns should be clear for everyone, so I skip their description :)
As you can see Karma stands out among others, it had much more contributors, forks, starts, etc. than for other tools all together. These numbers may partially be expained with that it’s the oldest tool among the reviewed ones. The first commit to the repo dates to Sep 23 2011.
Review: performance comparison
For this part I took a dataset published by ClearSpending.ru that contains public contracts from the Russian Tender System. From all the archives I’ve selected one with contracts published in March 2018, it contains a single big JSON file (~560MB). To make the comparison more complete, I converted the original JSON file to CSV and XML formats, and for each format files with 100, 500, 1000, 5000, 10000 and 100000 contracts were generated.
To converted JSON to XML I’ve adapted the json2xml library. It’s shipped without a ready-to-use .jar file and doesn’t support the Cyrillic encoding, but it was relatively easy to extend with the needed functionality.
The json2csv converter were used to produce CSV files. Fields with large arrays were dropped from the original JSON before converting it in CSV, otherwise the CSV may contain hundreds of thousands of columns, since the json2csv tool maps each array element to a separate column.
Here are the file sizes for each format (JSON, CSV, XML respectively):
- 100 — 556 KB, 303 KB, 653 KB
- 500 — 2.4 MB, 1.4 MB, 2.9 MB
- 1000 — 4.5 MB, 2.6 MB, 5.4 MB
- 5000 — 21 MB, 13 MB, 26 MB
- 10000 — 40 MB, 24 MB, 49 MB
- 100000 — 412 MB, 214 MB, 497 MB
From all the data only five fields were taken to generate RDF data: id, number, contractCreateDate, customer.OGRN, suppliers..ogrn. Example of these fields in the JSON files (the rest of the fields aren’t shown):
And the outputted RDF data look like that:
In the table below you can find links to the mapping files for each tool:
- Tarql — CSV
- RML-Mapper — JSON, CSV, XML
- SPARQL-Generate — JSON, CSV, XML
- SETLr — CSV
- Karma — JSON, CSV, XML
- LinkedPipes — JSON, CSV, XML
When the data and mappings were ready, the tools were run sequentially on a machine (in Google Cloud) with the following parameters:
- 2 vCPUs & 7.5 GB memory
- OpendJDK 1.8.0_171 & JVM Options: -Xmx6G
- Python 2.7
The following charts present the results of the performance comparison. If there is no result on a chart for a particular tool then it means that either this tool doesn’t support the format or it crashed because of the out-of-memory error. The execution times were measured with the time tool.
So what do we see on these charts? Let’s first look at the tools which support only CSV format.
- Tarql performed very good, it succesfully converted all the CSV files and it was the fastest one with the large files.
- SETLr wasn’t able to convert the largest CSV file, because it failed with the out-of-memory error, but in other cases it had similar to Tarql results.
And now the tools which support all three formats:
- RML-Mapper has the worst results, except only the smallest files where it was the penultimate with all three formats. But it’s the only tool which was able to complete process all the files.
- SPARQL-Generate was the fastest one with the JSON files, but has relatively average results with other formats and even failed to process the largest CSV and XML files.
- Karma, quite similar to SPARQL-Generate, it has average results and wasn’t able to complete processing of the largest files.
- LinkedPipes was the fastest one with the CSV and XML files, but has one of the worse results with the JSON files.
To summarise, I’d say that LinkedPipes is the fastest tool in terms of the execution time, if it’d not fail to process the largest JSON file and wouldn’t have quite worse results with JSON in general.
Which one of them is the best?
Ahh, when I just started working on this review, I naively though that I’ll definitely find the best tool for my tasks…But unfortunately I’ve not found such tool :(. Each of them has it’s own limitations and features which I don’t really like. Anyway let me summarise my thoughts about each of the tools.
As you may guess I’m skipping Tarql and SETLr here, because their only support column-based formats and can’t be adapted to work with other formats. But if I’d work only with CSV, then I’d use Tarql, because of it’s performance and usage of SPARQL as a mapping language.
Karma is quite similar to OpenRefine in terms of UI and usage patterns. It has performance problems, but looks like it support the MapReduce approach, so they may be mitigated. The main limitations for me is that it couldn’t easily integrate with a VCS and the mappings can’t be written without the UI.
RML-Mapper has bad performance results, so it’s not suitable for real-world datasets, although it’s the only tool which successfully complete processing of all the files.
LinkedPipes and SPARQL-Generate aren’t comparable with each other, because the first one is an ETL tool, but the last one can only transforms data, but can’t extract or load it in/from sources.
Although I’ve said above that I couldn’t identify a clear winner, but as a result of this review, I found that I’d like to try SPARQL-Generate as my next non-RDF-to-RDF converter. There are several reasons for that:
- its mapping language is based on SPARQL, so it’s easy to get started;
- the mappings can be edited in a text editor;
- it has an extandable architecture, so I can implement, e.g. the streaming support and etc., myself;
- it can be embedded in a more complex pipelines, such as built on top of Apache Beam and others.