OGI at the ‘Department of environment and spatial planning’ of the Flemish Government
The Department of environment and spatial planning of the government of Flanders is responsible for preparing, following up and evaluating the Flemish environmental policy. Companies that want to emit polluting substances in the air or water must obtain a permit. Once obtained, they are required to report annually about the emissions done during the previous year. Due to this obligation, data regarding the amount of emitted substances per location have been collected since 2004.
Aim of the project
The aim of the project is to integrate these data with all kinds of other data sources (internal and external) to offer:
- the general public an application showing the emitted substances over the years in the area that they live in
- companies the ability to benchmark themselves against other, similar, organisations
- public servants analytics dashboards to gain insights to steer the policy making.
The ultimate aim is that by integrating the data, (research) questions can be addressed which can not be answered using the separate data silos.
The solution preferably needed to be based on open source software or using open source libraries.
Data integration needed
More than 10 data silos needed to be integrated. These datasets contain data:
- managed by the department itself (archival systems, RDBMS)
- datasets published by other departments (Flemish addresses database, Belgian company register) accessible via external SPARQL endpoints
- together with well-known classification systems such as NACE (economical activities) and NIS (administrative geographical entities) available also as external SPARQL endpoints
The most important datasets are in fact observations. Hence our use of the RDF Data Cube vocabulary: a vocabulary explicitly made to capture statistical data to allow OLAP operations such as slice and dice, roll-up and drill-down. The code lists are encoded in XKOS, an extension of SKOS (Simple Knowledge Organization System) for statistical classifications.
The solution tries to avoid:
- replicating external datasets in the own repository
- ETL (Extract, Transform, Load) processes.
The first axiom is a little bit controversial, since we all know that external endpoints are not that reliable and an architecture is just as solid as its weakest point. See Verborgh Ruben, Can I SPARQL an endpoint? and Beek Wouter, Rietveld Laurens et al, Why the Semantic Web Needs Centralization (Even If We Don’t Like It).
However, in our case the requests going out are just very simple queries where the external endpoints are serving as lookup ‘tables’. If an external endpoint goes down it will not be our fault. And we still dream of the web as a distributed database.
We use a main triple store. In this triple store we manage the results of our ETL processes. We have as main sources 2 archival systems (DSpace) containing XML reports on the observed emissions. These are extracted and transformed to the RDF Data Cube format using XSLT. They are accompanied by lots of controlled vocabularies, converted to SKOS/XKOS using various means (TARQL, XSLT, OpenRefine …)
And we look up info in 3 external endpoints (CRAB, KBO, fedStats).
Most of the requested reports require to query more than the main endpoint.
E.g. one of the requested reports is to calculate the total emission of a certain substance per administrative geographical level (municipality, province, region).
The related SPARQL query:
One can see that this query goes out to 3 endpoints: our main triple store, the virtualised endpoint above our RDBMS and an external one publishing some official thesauri.
With this as a result.
We experienced however that doing this type of federated query is not that obvious in some triple stores/sparql endpoints, a subject we will elaborate upon in a separate post.
Applications using the infrastructure
Now that we have the infrastructure up and running apps are being build to leverage the integrated data.
One example is an application that allows citizens to investigate air emissions is the neighbourhood over time.
All observations and other managed entities are also published with dereference-able url’s.
An example of an observation according to the RDF data cube vocabulary:
This LOD publishing is using the NetKernel Linked Open Data edition.
Data exploration, visualisation
Within the project we developed a SPARQL connector for the exploratory.io data science tool, which allows us to take SPARQL results as input for our analysises.
In the actual situation someone who wants to query the system needs to know in advance which endpoint serves which information. It is our hope to get rid of the many SERVICE clauses in our queries by implementing federation systems such as fedX or Semagrow.
We will of course implement real data cube functionalities developed by our consortium R&D partners.
Using semantic technologies we were able to integrate several different datasets and expose those as one integrated queryable set.
Doing this, 5 star linked open data publishing came right out of the NetKernel box.
We are confident we have now the foundations to start building all kinds of interesting (data cube related) stuff with the aim to make policy building more data driven.